Systems and Methods for Ranking Data Visualizations

ABSTRACT

A computer device receives user selection of a plurality of data fields from a data set. The computer device generates a plurality of data visualization options that use a majority of the plurality of data fields. The computer device computes, for each data visualization option of the plurality of data visualization options, a respective score for the respective data visualization option according to a set of ranking criteria. The set of ranking criteria includes a first ranking criterion that is based on values of one or more of the user-selected data fields in the data set. The computer device creates a ranked list of the data visualization options. The ranked list is ordered according to a plurality of computed scores corresponding to the plurality of data visualization options. The computer device presents the ranked list to the user.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.15/908,709, filed Feb. 28, 2018, entitled “Constructing DataVisualization Options for a Data Set According to User-Selected DataFields,” which is a continuation of U.S. patent application Ser. No.14/242,857, filed Apr. 1, 2014, entitled “Systems and Methods forRanking Data Visualizations Using Different Data Fields,” each of whichis incorporated by reference herein in its entirety.

TECHNICAL FIELD

The disclosed implementations relate generally to data visualizationsand more specifically to ranking alternative data visualizations basedon a set of data fields.

BACKGROUND

Data visualizations are an effective way to communicate data.Information visualization uses visual representations of data to aid inhuman understanding of relationships and patterns in the data. With theproliferation of “big data” there is increasing demand for data analystsfamiliar with visual analytics, but there is a short supply of suchindividuals and tools. Making the tools easier to use would enable alarger number of people to take charge of their data questions andproduce insightful visual charts.

Some data visualization systems include tools to assist people in thecreation of data visualizations, and some systems even make suggestionsbased on the data types of selected fields. For example, if twoquantitative fields are selected, a scatter plot may be recommended.Examples of such systems are described in U.S. Pat. No. 8,099,674,entitled “Computer Systems and Methods for Automatically ViewingMultidimensional Databases,” which is incorporated herein by referencein its entirety.

Some data visualization systems automatically generate marks in a datavisualization to represent one or more data fields from a data source.For example, some techniques are described in U.S. patent applicationSer. No. 12/214,818, entitled “Methods and Systems of AutomaticallyGenerating Marks in a Graphical View,” which is incorporated herein byreference in its entirety.

SUMMARY

Disclosed implementations provide a recommendation engine for datavisualizations. The systems take a set of data fields selected by a userand intelligently suggest good visual representations to further theuser's analysis. Implementations identify a set of possible datavisualizations based on the selected data fields, then rank theidentified data visualizations. Some implementations rank datavisualizations based on visual aspects of presenting the underlying datavalues (e.g., clustering, outliers, and image aspect ratio).

With a very large number of potential data visualizations, a good systemmust present the “better” alternatives first. For example, there may be10,000 or more alternative data visualizations for a selected set ofdata fields. It would not be much help to a user if the 10,000 optionswere listed in a random or arbitrary order. Some implementations rankthe alternative data visualizations in a two part process. First, foreach view type (e.g., bar chart, line chart, scatter plot, etc.) theranking system ranks the alternatives within that view type (e.g., rankall of the alternative bar charts against each other). Second, thesystem merges the rankings into a single overall ranking.

Disclosed implementations typically use multiple criteria for ranking.Some criteria measure statistical structure in the data (e.g., visualpatterns in a visualization such as outliers or clusters). Some criteriameasure the similarity of a potential data visualization to previousdata visualizations selected by a user (e.g., comparing the level ofdetail, the x-axis and y-axis for layout of the data, and other visualencodings, such as size or color). Previous selections may be from thesame user who is preparing a data visualization now, or from a differentuser or set of users. Some criteria measure the aesthetic qualities(e.g., aspect ratio) of a potential data visualization. Some criteriause user preferences (e.g., a preference for certain view types orencodings within a view type). Some criteria use aggregate preferencesbased on the history of multiple users (either for the specific datafields currently selected or more generally). By combining thesecriteria, the ranking correlates with effectiveness at representingstructures in the data and delivering insight to the user.Implementations assign weights to each of the criteria, and typicallyupdate the weights based on continued feedback from users (e.g., bycomparing the data visualizations selected to the calculated rankings).

Disclosed implementations assist users in the cycle of visual analysis.The cycle typically proceeds by selecting a set of data fields, visuallyrepresenting those data fields in some way, noticing results from thevisual representation, and asking follow-up questions. The follow-upquestions often lead to more data visualizations, which may drill down,drill up, filter the data, bring in additional data fields, or just viewthe data in a different way. Creating views of the data can be a slowtask, particularly when a user is not familiar with the visual analytictool or when the task is not clear. For example, it may not be clear toa user what view type to create, what level of detail to select for thedata, or what aesthetics would be useful. Disclosed implementationsspeed up the user's journey to insight by identifying good, analyticallyuseful views of the user selected data fields and presenting those viewsin ranked order.

Providing a ranked list of meaningful views of selected data has twomain phases. First, a system must identify a set of possible views forthe selected set of data fields. This is sometimes referred to as the“generation” phase. Second, the system ranks each of the possible views.This is sometimes referred to as the “evaluation” phase.

Implementations use various criteria in the evaluation phase. Forexample, some criteria quantify the extent to which a possible datavisualization displays some “interesting” structure or pattern thatexists in the data. Some interesting structures relate to statisticalproperties of the selected data fields or relationships between theselected data fields. A particular visual representation is rankedhigher when such structures or patterns are visually identifiable. Somecriteria apply information visualization best practices to present thedata in an aesthetically pleasing and clear manner. As described in moredetail below, these criteria and others are applied together to evaluatevisual representations for the selected set of data fields.

Some criteria depend heavily on the view type of each data visualizationbecause different view types have different strengths. For example,different view types are better able to represent different types ofdata, different view types are able to aesthetically represent differentamounts of data, and different view types facilitate various analytictasks. Because of this, some implementations divide the evaluation intotwo parts: rank the possible data visualizations within each view type,then combine the ranked lists of views of different types together toprovide a diverse list of analytically useful views of the selected datafields.

A simple example illustrates typical processes. Consider a set ofquantitative data with a geographic component that may be visualized asa text table, a bar chart, or a map. The map is the best at highlightingthe geographical distribution, so it is ranked first. The bar chartworks well to showcase the overall trend of the quantitative variableand to make more precise relative comparisons of values encoded as barlengths, so it is ranked next. A text table has the densest display andis good for looking up precise details, but is ranked last. Of coursethe ranking could be different based on other criteria, such as a userpreference to see data in text tables. One of the advantages of someimplementations is providing a unified way to combine various criteria,which can result in different rankings depending on the user, the user'shistory, historical usage of the data set, current selections by theuser, and so on.

In some implementations, the list of meaningful views presented to theuser includes views with modified sets of data fields (i.e., the set ofdata fields is not exactly the set of data fields the user selected).For example, views may include additional data fields, fewer datafields, or replace a selected data field with another data field. Inaddition, some implementations add or modify filters of the data (e.g.,sales data filtered to 2015 may provide more useful information if salesdata for 2014 were included as well). Some implementations include theseadditional views in the same ranked list that includes the views thatuse exactly the data fields selected by the user. Other implementationsplace these “complementary” views in a separate ranked list.

When all of the views are presented together, some implementationsinclude criteria for how to interleave the data visualizations. Forexample, some implementations include a weighting factor based onwhether a data visualization uses exactly the data fields selected bythe user. For example, a ranking score may be decreased by eachmodification to the user-selected set of data fields. Note that a reallygood data visualization based on a modified set of fields may be rankedhigher than some average data visualizations that use the exact set ofuser selected fields.

In accordance with some implementations, a method executes at acomputing device having one or more processors and memory. The memorystores one or more programs configured for execution by the one or moreprocessors. The computing device receives user selection of a set ofdata fields from a data set, and identifies a first plurality of datavisualizations that use each data field in the user-selected set of datafields. For each of the first plurality of data visualizations, theprocess computes a respective score based on a set of ranking criteria.At least one ranking criterion used to compute each score is based onvisual patterns corresponding to statistical properties of data valuesof one or more of the user-selected data fields. The process alsoidentifies a second plurality of data visualizations. Each datavisualization in the second plurality uses a majority of theuser-selected data fields and also uses a respective additional datafield, from the data set, that is not in the user-selected set of datafields. For each of the second plurality of data visualizations, theprocess computes a respective score based on the set of rankingcriteria. At least one ranking criterion used to compute each score isbased on visual patterns corresponding to statistical properties of datavalues of the respective additional data field. The process then forms arecommended set of data visualizations, which includes one or more datavisualizations, from the first plurality, having high computed scores,and also includes one or more data visualizations, from the secondplurality, having high computed scores. The process presents therecommended set of data visualizations to the user.

In some implementations, the process presents the recommended set ofdata visualizations to the user as a single ranked list, which isordered according to the computed scores of the data visualizations inthe first and second pluralities.

In some implementations, the process presents the recommended set ofdata visualizations to the user as two ranked lists. The first rankedlist comprises high scoring data visualizations in the first plurality,ordered according to corresponding computed scores, and the secondranked list comprises high scoring data visualizations in the secondplurality, ordered according to corresponding computed scores.

In some instances, at least one of the second plurality of datavisualizations is based on fewer than all of the data fields in theuser-selected set of data fields. In some instances, at least one of thesecond plurality of data visualizations is based on all of the datafields in the user-selected set of data fields.

In accordance with some implementations, a method executes at acomputing device with one or more processors and memory to identify andrank a set of potential data visualizations. The method receives userselection of a set of data fields from a set of data and identifies aplurality of data visualizations based on the plurality of user-selecteddata fields. For each of the plurality of data visualizations, a scoreis computed based on a set of ranking criteria. A first rankingcriterion of the set of ranking criteria is based on values of one ormore of the user-selected data fields in the set of data. A first rankedlist of the identified data visualizations is created, which is orderedaccording to the computed scores of the data visualizations. In someimplementations, the first ranked list is presented to the user.

In accordance with some implementations, a method executes at acomputing device with one or more processors and memory to identify andrank a set of potential data visualizations. A user selects a pluralityof data fields from a set of data, and the device identifies a pluralityof data visualizations that use a majority of the user-selected datafields. For each of the plurality of data visualizations, the devicecomputes a score based on a set of ranking criteria. A first rankingcriterion of the set of ranking criteria is based on values of one ormore of the user-selected data fields in the set of data. The devicecreates a first ranked list of the data visualizations, where the itemsin the list are ordered according to the computed scores of the datavisualizations. In some implementations, the first ranked list ispresented to the user. In some implementations, the user selects fromthe first ranked list and the computing device displays a datavisualization corresponding to the user selection.

In accordance with some implementations, a method executes at acomputing device with one or more processors and memory to identify andrank a set of potential data visualizations. A user selects a set ofdata fields from a set of data, and the device identifies a plurality ofdata visualizations that use each data field in the user-selected set ofdata fields. In addition, the device identifies a plurality ofalternative data visualizations. Each alternative data visualizationuses each data field in a respective modified set of data fields. Eachrespective modified set differs from the user-selected set by a limitedsequence of atomic operations (e.g., at most two). Too many changeswould lead to an exponential increase in the number of options toevaluate, and those options would deviate further from what the userrequested. Examples of atomic operations include: adding a single datafield that was not selected by the user; or removing one of the userselected data fields. For each of the data visualizations and each ofthe alternative data visualizations, the device computes a score basedon a set of ranking criteria. At least one criterion used to computeeach score uses values of one or more of the data fields in the set ofdata (e.g., one of the data fields on which an alternative datavisualization is based). Finally, a subset of the highest scoring datavisualizations and alternative data visualizations is presented to theuser.

In some implementations, the first ranking criterion scores eachrespective data visualization according to visual structure of values ofone or more of the user-selected data fields as rendered in therespective data visualization. In some implementations, the visualstructure includes clustering of data points. In some implementations,the visual structure includes the presence of outliers. In someimplementations, the visual structure includes monotonicity of rendereddata points (i.e., monotonically increasing, monotonicallynon-decreasing, monotonically decreasing, or monotonicallynon-increasing). In some implementations, the visual structure includesstriation of a data field, wherein each respective value of the datafield is substantially a respective integer multiple of a single basevalue.

In some implementations, the first ranking criterion scores eachrespective data visualization according to one or more aestheticqualities of the respective data visualization as rendered using valuesof one or more of the user-selected fields. In some implementations, theaesthetic qualities include the aspect ratio of the rendered datavisualizations. In some implementations, the aesthetic qualities includemeasuring an extent to which entire rendered data visualizations can bedisplayed on a user screen at one time in a human readable format.

In some implementations, the first ranking criterion scores eachrespective data visualization according to visual encodings of one ormore of the user-selected data fields. In some implementations, visualencoding of a user-selected data field comprises assigning a size,shape, or color to visual marks according to values of the user-selecteddata field.

In some implementations, each of the data visualizations has a uniqueview type that specifies how it is rendered. In some implementations,each of the data visualizations has a view type selected from the groupconsisting of text table, bar chart, scatter plot, line graph, and map.In some implementations, the first ranking criterion scores eachrespective data visualization according to the view type of therespective data visualization and the user-selected data fields. In someimplementations, the set of ranking criteria is hierarchical, comprisinga first set of criteria that ranks view types based on the user-selecteddata fields, and a respective view-specific set of criteria that ranksindividual data visualizations for the respective view type based on theuser-selected fields.

In some implementations, the method further includes identifying aplurality of alternative data visualizations based on one or moremodifications to the set of user selected data fields, and for each ofthe plurality of alternative data visualizations, computing a scorebased on the set of ranking criteria. In some implementations, the firstranked list includes the plurality of data visualizations and theplurality of alternative data visualization, and the first ranked listis ordered according to the computed scores of the data visualizationsand the computed scores of the alternative data visualizations. In someimplementations, the method further includes creating a second rankedlist of the alternative data visualizations, where the second rankedlist is ordered according to the computed scores of the alternative datavisualizations. The first and second ranked lists are presented to theuser. In some implementations, the modifications include adding one ormore additional data fields to the set of data fields. In someimplementations, the modifications include removing one or more datafields from the set of data fields. In some implementations, themodifications include replacing a first user selected data field with adifferent data field that is hierarchically narrower than the first userselected data field. In some implementations, the modifications includereplacing a first user selected data field with a different data fieldthat is hierarchically broader than the first user selected data field.In some implementations, the modifications include applying a filter tothe user selected data fields, wherein the filter was not selected bythe user. In some implementations, the modifications include modifying auser selected filter.

In accordance with some implementations, a method executes at acomputing device with one or more processors and memory to generate andrank a set of potential data visualizations. The method receives userselection of a set of data fields from a set of data and generates aplurality of data visualization options. Each data visualization optionassociates each of the user-selected data fields with a respectivepredefined visual specification feature. For each of the generated datavisualization options, the computing device calculates a score based ona set of ranking criteria. A first ranking criterion of the set ofranking criteria is based on values of one or more of the user-selecteddata fields in the set of data. The computing device creates a rankedlist of the data visualization options, where the ranked list is orderedaccording to the computed scores of the data visualization options. Thedata visualization options in the ranked list are presenting to theuser. In some instances, the user makes a selection from the rankedlist, and the computing device displays a data visualization on thecomputing device corresponding to the user selection.

In some implementations, the computation of scores for one or more ofthe data visualizations uses historical data of data visualizationspreviously created for the set of data. For example, the historicalusage of the set of data may favor certain types of data visualizationsor certain types of encodings. For example, an organization may use aspecific color encoding for divisions or departments. As anotherexample, users of the data set may prefer stacked bar charts. Historicalusage data can identify features that are preferred by users of thedata, as well as those features disfavored (e.g., if a certain numericfield has never been used for a size encoding, then it would probablynot make a good recommendation). Historical information about usage canbe particularly valuable when the usage is unusual for the set of data.Historical usage information can also be applied at a more abstractlevel, and creates “best practice” heuristics when historical usageinformation is not available for a specific data source.

In addition to historical data about how a particular data set has beenused, some implementations use historical information about the datavisualizations a specific user has selected. For example, if a certainuser has favored line graphs for visualizations based on various datasources, then line graphs would be more highly recommended whenappropriate. As another example, another user may consistently use colorencodings, and thus use of color is a good suggestion. On the otherhand, for a user who never (or rarely) uses color encodings, a colorencoded data visualization would not be a good recommendation.Historical data can also identify preferences for certain datavisualization variants. For example, a user may consistently create barcharts with horizontal bars, and thus when bar charts are ranked,horizontal bars would be ranked higher. The historical data used in theranking of potential new data visualization can come from varioussources. First there is historical data of data visualizationspreviously selected by the user. Second, there is historical datashowing how a user ranked or compared previous data visualizations. Forexample, suppose the ranking system previously presented a user with aset of data visualization options for a data source. When the userselects a specific option, the user has implicitly ranked that optionhigher than the other options that were presented. Some implementationsseek specific ranking feedback, particularly for new users. For example,if five data visualization options are presented, ask the user to rankthem from 1 to 5. Whether ranking information is collected explicitly orimplicitly, it can be used in future ranking calculations. In someimplementations, a user's data visualization history is included in auser profile or set of user preferences. In some implementations, userpreferences can be identified either through historical usage, fromexplicitly user selection, or both. In particular, a user can specifywhich types of data visualization or features are preferred ordisfavored. Subsequent ranking can user the preferences to computescores for one or more of the data visualizations.

In some implementations, the method further includes receiving userselection of a filter that applies to a first user selected data field,where the filter identifies a set of values for the data field and thedata visualizations are based on limiting values of the data field tothe set of values. In some implementations, the set of values is afinite set of discrete values. In some implementations, the set ofvalues is an interval of numeric values.

In some implementations, a first data visualization of the datavisualizations applies a filter to a user selected data field, therebylimiting the values of the user selected data field to a first set ofvalues, where the filter is not selected by the user.

In some implementations, the method further includes receiving userspecification of one or more visual layout properties for layout of adata visualization that includes the user selected data fields, wherethe set of ranking criteria includes a second ranking criterion thatmeasures an extent to which a data visualization of the plurality ofdata visualizations is consistent with the user specified visual layoutproperties.

In some implementations, the method further includes receiving userspecification of a single view type and the plurality of datavisualizations are identified according to the user specified singleview type.

In accordance with some implementations, a computer system has one ormore processors and memory. The memory stores one or more programs. Theone or more programs are configured for execution by the one or moreprocessors, and the one or more programs comprise instructions forperforming any of the methods described herein.

In accordance with some implementations, a non-transitory computerreadable storage medium stores one or more programs configured forexecution by a computer system having one or more processors and memory.The one or more programs comprise instructions for performing any of themethods described herein.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a context for a data visualization ranking process inaccordance with some implementations.

FIG. 2 is a block diagram of a computing device in accordance with someimplementations.

FIG. 3 is a block diagram of a data visualization server in accordancewith some implementations.

FIG. 4 illustrates the overall process flow for identifying and rankingdata visualizations in accordance with some implementations.

FIG. 5 illustrates a process flow for ranking data visualizations inaccordance with some implementations.

FIGS. 6A and 6B illustrates various ways that a user-selected set ofdata fields may be modified in order to expand the set of possible datavisualizations.

FIGS. 7A and 7B illustrate two alternative data visualizations that havedifferent aspect ratios.

FIGS. 8A and 8B illustrate two alternative bar graphs with differentaesthetic properties.

FIGS. 9A, 9B, and 9C illustrate three scatter plots using variouscombinations of two numeric variables.

FIGS. 10A and 10B illustrate two maps that encode data in differentways.

FIGS. 11A and 11B illustrate clustering and outliers in scatter plotdiagrams.

FIGS. 12A and 12B illustrate some structural patterns in line charts.

FIG. 13 illustrates a screen showing a ranked list of datavisualizations in accordance with some implementations.

FIG. 14 illustrates a data visualization history log in accordance withsome implementations.

FIG. 15 illustrates a data visualization ranking log in accordance withsome implementations.

FIGS. 16A and 16B illustrate how quantitative data fields can berearranged in accordance with some implementations.

FIGS. 17A-17C provide a flowchart of a process, performed at a computingdevice, for generating and ranking data visualizations in accordancewith some implementations.

FIGS. 18A-18D provide a flowchart of another process, performed at acomputing device, for generating and ranking data visualizations inaccordance with some implementations. Some implementations combine theprocess in FIGS. 18A-18D with the process in FIGS. 17A-17C.

FIGS. 19A-19D provide a flowchart of another process, performed at acomputing device, for generating and ranking data visualizations inaccordance with some implementations. Some implementations combine theprocess in FIGS. 19A-19D with the processes in FIGS. 17A-17C and/or18A-18D.

Like reference numerals refer to corresponding parts throughout thedrawings.

Reference will now be made in detail to implementations, examples ofwhich are illustrated in the accompanying drawings. In the followingdetailed description, numerous specific details are set forth in orderto provide a thorough understanding of the present invention. However,it will be apparent to one of ordinary skill in the art that the presentinvention may be practiced without these specific details.

DESCRIPTION OF IMPLEMENTATIONS

Implementations of a data visualization ranking system typically havetwo phases. In the first phase (“generation”), the system constructsinstances of view types that are appropriate visual representations forthe selected set of data fields. In some implementations, alternativemodified sets of data fields are used to build supplemental views (e.g.,a superset or subset of the user-selected data fields). In the secondphase (“evaluation”), the system ranks the data visualizations so that asmaller number of the best options are presented to the user. Presentingalternative views of data to analytic users facilitates their dataexploration and increases the likelihood that they find relevant, usefulviews that help answer their data questions more quickly or effectivelythan constructing alternative data visualizations manually.

The generation phase typically follows one of three paths: (1) generateall possible views based on the selected set of data fields; (2)generate all possible views, then cull to a smaller set using asimplified evaluation process; or (3) generate a set of “representative”good views. Using all views may better guarantee finding the bestoption, but the cost of evaluating all options is typically too highbased on the computing devices that are widely available.

For large data sets, some implementations have a two phase approach. Inthe first phase, identify a sample of the data from the data source(e.g., 5% or 10% of the rows), and proceed to identify a set of gooddata visualizations based on the sample. In the second phase, the fullset of data is used, but the data visualization options are limited tothe ones that scored sufficiently high in the first phase. One skilledin the art recognizes that there are various ways to select the sampledata, such as a random sample, the first n rows for some positiveinteger n, or every nth row for some positive integer n.

When all possible visual representations of the selected set of datafields are evaluated, there is an exponential number of options formapping each of the data fields to visual encodings. In addition, someof the encodings can accept multiple data fields (e.g., the data fieldsused to define the X-position and Y-position of graphical marks in thedisplay), so there are additional permutations of the data fields forthese encodings (e.g., the order of fields used to specify theX-position or Y-position of graphical marks). Each permutation producesa different data visualization based on the ordering of data fields. Insome implementations, the complete set is generated, then subsequentlyculled. Because only the top options will be presented to the user, manydata visualization options can be culled with only limited analysis. Forexample, a quantitative field with a negative value would not beappropriate for size encoding, so that feature is excluded. Similarly,the cardinality of an ordinal field influences how it can be usedeffectively, as described in examples below. For example, if thecardinality is too large, then it would not be a good choice for colorencoding or as an innermost field that defines the X-positions andY-positions of graphical marks.

Some implementations generate a limited set of good visualrepresentations of the data fields to significantly reduce the number ofpossible data visualizations evaluated. In some implementations, thisuses mapping rules based on data type semantics and effectiveness ofcertain visual encodings to identify appropriate view typerepresentations. For example, a certain set of data fields may be bestrepresented as a map chart or scatter plot diagram, so only these twoview types are pursued (e.g., excluding bar charts, line charts, andtext tables). Subsequently, specific instances of each selected viewtype are identified, typically by applying information visualizationbest practices.

A brute force generation process iterates over all possible mappings ofthe selected set of data fields onto all visual encodings (e.g.,X-position, Y-position, color, size, shape, and level of detail). Ifthere are m visual encodings and k selected data fields, there are m^(k)such mappings. As noted above, some encodings can handle multiple datafields and produce different visual representations based on the order,so the actual number is higher than m^(k). For example, the X-positioncan represent multiple fields (e.g., “dimensions”) where the order ofthe data fields determines the nesting order of panes or partitions inthe view. This large set of alternatives can be culled to produce a setof visualizations that represent best practices in informationvisualization and perception. Some of these best practices includeapplying principles of effectiveness in visual representation that favormapping data fields of certain types to certain encodings. This processcan eliminate some bad visual representations quickly. For example, aline chart without a temporal dimension is typically not useful. Anotherbest practice that produces good views is to use low cardinalitycategorical dimensions for color and shape encodings because a user caneasily distinguish a small number of different sizes or shapes. A“categorical” data field is a data field with a limited number ofdistinct values, which categorize the data. For example, a “gender” datafield is a categorical data field that may be limited to the two values“Female” and “Male” or “F” and “M”.

Some implementations use a constrained generation algorithm. Theseimplementations use information visualization effectiveness principlesthat determine the set of view types that create appropriate visualrepresentations of a particular set of data fields. Once specific viewtypes are selected, good instances of each applicable view type arecreated. Applying a set of rules (e.g., codifying best practices ininformation visualization and graphic design), the system maps the datafields to visual encodings. This constrains the set of alternativeswithin each view type. For example, categorical data fields with smallcardinality may be mapped to color or shape encodings.

Within a single view type, alternative data visualization instances aregenerated in several ways. In some instances, alternative views aregenerated by changing the order of data fields that define theX-positions and Y-positions of graphical marks, which affects not justthe axes but also the level of breakdown in the creation of text tablesand small multiples. In some instances, alternative views are generatedby trying all good choices for color, shape, and size encodings. In someinstances, alternative views are generated as view type variants (e.g.,filled maps vs. symbol maps; bar charts that are stacked, horizontal, orvertical; etc.).

The disclosed ranking techniques can be applied regardless of how thepossible data visualization are identified. In addition, someimplementations use some ranking techniques in the generation phase(e.g., using a subset of the techniques that can be applied quickly toreduce the number of alternative data visualizations that proceed to thefull evaluation phase). Some ranking systems implement a “progressive”or “hierarchical” process with multiple passes to triage the datavisualization options piecemeal. In a progressive ranking process, avery high percentage of the options are eliminated in a first level cullbased on simple criteria that can be applied quickly. Each subsequentculling uses more detailed information to identify the options that willprogress to the next level. Some implementations have severalprogressive culling steps before the complete ranking is applied to asmall subset of the originally identified options. In a progressiveprocess, some implementations compute partial ranking of data at eachlevel, and retain the partial ranking information for use on subsequentlevels.

Disclosed ranking methods evaluate the collection of views based on thesets of data fields selected (either the set of data fields selected bythe user, or a modified set of data fields, such as a reduced orexpanded set). The views are scored based on a combination of factors.The factors include appropriateness to the data types. For example, ifthe set of data includes a geographic component, then a map view of thedata is weighted more highly. The factors also include the visualstructure presented by the view. For example, when there are multiplepossible scatter plot views of the data, the one with a visual patternsuch as clustering or correlation is weighted more highly. Techniques toidentify visual patterns are described in more detail below, includingin regard to FIGS. 9A-9C, 11A, and 11B. The factors also include theaesthetics of the visual layout. For example, data visualizations thatfit entirely within the display or avoid overlapping labels arepreferred. This is described in more detail below, including with regardto FIGS. 7A, 7B, 8A, and 8B. In addition, the factors include similarityto the user's previously created data visualizations. For example, whattypes of data visualizations has the user selected, in what contexts arethose visualizations selected, what types of encodings (such as color,size, or shape) does the user prefer, and so on. The factors alsoinclude relevant user preferences, and in some implementations theaggregated preferences of one or more groups (e.g., the group of peopleworking in the finance department in an organization, or the group ofall users).

In some implementations, the ranking proceeds as a single step. In otherimplementations, each possible data visualization is first ranked withinits view type (e.g., for the view type “bar chart,” all of the barcharts are ranked against each other, whereas all scatter plot diagramswould be ranked against each other within the “scatter plot” view type).The views within each view type are ranked using criteria based on theproperties of the view type, the selected data fields, and userproperties (e.g., user history, user preferences, or aggregated historyof multiple users). Finally, the system combines the ranked lists ofview instances of different view types, applying criteria about therelative value of chart types for the data types in the user-selectedset. For example, if the user-selected set of data fields includes atemporal field along with a quantitative field, a line chart is probablymore useful than a text table view. A line chart is better atvisualizing trends, clusters, and anomalies over time. In someimplementations, the views exhibiting best practices and a notion ofdiversity of views are at the top.

The identified (or “generated”) data views are scored in the evaluationphase using a variety of weighted criteria. One skilled in the artrecognizes that the weighting of criteria can change over time based onfeedback from users (explicit or implicit), the addition of newcriteria, and so on. Further, the criteria identified herein are notintended to be exhaustive, and one of skill in the art recognizes thatother similar criteria may be used. The criteria for evaluatingidentified data visualizations include statistical properties in thedata that can be seen as visual patterns in the view (e.g., clumping,outliers, correlation, or monotonic graphs). The criteria for evaluatingdata visualizations also include aesthetic properties of the visuallayout of the view. Of course only quantifiable aesthetic qualities areincluded in the evaluation process (e.g., aspect ratio). In addition,other user-specific criteria may be used. For example, a user mayindicate a preference for certain types of encoding (e.g., a CFO mayprefer to use specific color encodings for each of the company's foursales regions). In addition, if a user has previously worked with thesame (or a similar) data set, the history of the previous datavisualizations may indicate preferences. Prior usage of the same orsimilar data set is particularly relevant when the user selects some ofthe same data fields from the data set.

Disclosed ranking methods combine a number of ranking criteria based onaspects unique to each data visualization type. Some ranking systemsimplement a separate scoring function for each view type, with thescoring function tailored to the particular data characteristics thatare visible. Below are five examples of view types and some simple usecases for each of these view types. Based on these examples, samplescoring functions are described that capture important aspects of thevisualizations.

There are also some criteria that are generally applicable across all(or almost all) view types. Large charts are ineffective for visual dataanalysis when they require scroll bars to fit on a display device. Someimplementations partially address this problem using automatic scaling,but scaling has limits (e.g., the text that is displayed cannot get toosmall). When only a portion of a visualization is visible, it takeslonger for a user to search and find points of interest, to make visualcomparisons, or to answer questions. Indeed, without a complete view,some of the benefits of a data visualization are lost. In addition,accuracy suffers because the user has to keep track of virtual referencepoints during scrolling actions that shift the viewport of analysis.Therefore, views that are larger than the canvas size are penalized.Some implementations also distinguish between horizontal scroll barsversus vertical scroll bars when they are necessary. Scrollingvertically is more comfortable for many users than scrollinghorizontally, so some implementations penalize vertical scroll bars lessthan horizontal scroll bars.

Also, when a user has created a view explicitly, selecting a particularview type or encoding of certain data fields, the ranking process favorsviews that closely adhere to the user's original selections. Forexample, if the user has already selected a view type, then the selectedview type has a preferential ranking. In addition, when the user hasselected some visual encodings (e.g., color is used to representdifferent sales regions), there is a preference to retain thoseencodings.

Text Tables

Text tables are commonly used to view numeric values as text with highlevels of precision. Two kinds of text tables are commonly constructed.One kind of text table displays details of each record or item on asingle row. This is standard practice for accounting purposes and is theformat used in typical spreadsheet programs. Each of the data dimensionsis placed in a column, resulting in a table whose length is based on thenumber of items in the dataset and whose width is based on the number ofdimensions in the data set. Within that format, the only variation ishow the dimensions are ordered.

A second kind of text table is a crosstab, which summarizes categoricaldata that displays the frequency distribution of the categories. Acrosstab can be created by a pivot operation in most spreadsheetprograms. The categorical dimensions define the X-positions andY-positions within a two-dimensional matrix. The intersection of row andcolumn categorical values forms a cell that represents a summary forthat combination of categorical values.

Certain observations pertain to both kinds of text tables and helpidentify ranking criteria for text tables. First, tables of textual datashould facilitate reading at several levels. At the elementary level,text tables enable quick comprehension of numeric values displayed asvisual marks. At the intermediate level, text tables enable perceptionof regularity and patterns in the data. At the global level, text tablesenable grasping the whole visual representation. This facilitation ofreading occurs when certain columns are colocated. For example, placingcolumns with similar data types (dates, text, numbers) togetherfacilitates reading. Similarly, placing functionally dependent datadimensions (e.g., hierarchies) next to each other facilitates reading.In addition, placing semantically related columns together (e.g., salesand profit; ship date and order date) facilitates reading. Therefore,some ranking methods for text tables score text table views according tothese rules. Implementations that cull or limit the set of possible datavisualizations select the text tables that best adhere to these rules.

Tables of text can be visually scanned quickly for patterns of stringssuch as increasing or similar length strings across rows. Therefore,some ranking criteria take this into account. Implementations that cullor limit the set of possible data visualizations may order thequantitative dimensions by placing similar (e.g., correlated) dimensionsnext to each other to facilitate the visual comprehension of suchquantitative data relationships.

Crosstabs that have a fewer number of items per pane are generallybetter than crosstabs that have a large number of items in each panebecause the smaller number of items facilitates comparison across panes.Empirical evidence indicates that people are better at retaining (andcomparing) chunks of approximately five data elements. Therefore, acategorical data field with a cardinality of about 5 is preferred at theinnermost nesting level in a text table. Implementations that cull orlimit the set of possible data visualizations may order the categoricaldata fields, placing a category with cardinality close to five as theinnermost level of the text table.

Finally, text tables that grow vertically are easier for humanunderstanding because they align with most traditional web, document,and table presentations. Scoring functions give a higher rank to texttables with a vertical aspect ratio than text tables with a horizontalaspect ratio. As noted earlier, text tables that can be built completelyon a display screen without scroll bars are ranked even higher (althoughit is not always possible to avoid vertical scroll bars).

Bar Charts

Bar charts are commonly used for visual data representations. Bar chartsare useful because people are good at making length comparisons andlocating a position along a common scale.

Two of the criteria identified above for text tables apply to bar chartsas well. Similar (correlated) quantitative dimensions are preferredcolocated because it is visually easy to detect patterns of similarlength bars. Also, the ordering of categorical dimensions favors placinga category with cardinality close to five as the innermost level of abar chart.

Sorted bars visually highlight overall trends (e.g., long-taileddistributions) and draw attention to outliers (e.g., very large or verysmall values) when a quantitative data field is represented by barlength. In some cases, the categorical dimension representing the barsis of greater interest for look-up purposes, so sorting the bars (e.g.,alphabetically) provides a more effective representation. Because thesetwo sorting methods (by bar length or by a categorical dimension) eachhave different advantages, user preferences or prior data visualizationsmay affect the ranking. For example, other users of the same data fieldsmay have shown a preference for one or the other sorting method.

Horizontal bar lengths can be compared easily across quantitativedimensions that are arranged vertically. The converse is true whenlooking at vertical bars. Some scoring functions prefer a verticalaspect ratio when horizontal bars are drawn and a horizontal aspectratio when vertical bars are drawn.

Scatter Plots

In many cases, bivariate distributions are visually best represented astwo dimensional point clouds, commonly referred to as scatter plots. Ascatter plot illustrates the relationship between the two quantitativedimensions plotted against each other on the x and y axes.

Shapes in point clouds often correspond to interesting statisticalproperties in the data. A two-dimensional scatter plot of uniform randomnoise is the baseline case depicting no pattern at all. Scoringfunctions look for various interesting shapes in the scatter plots, suchas clumps (clusters of points), monotonicity (positive or negativecorrelation), striation (presence of a variable taking on discretevalues, such as integers), or outliers. Identifying shapes or structurewithin scatter plots is described in greater detail below. The presenceof any such shapes in a scatter plot increases the score of the scatterplot. Some implementations use formulas or methods described in“Graph-Theoretic Scagnostics,” L. Wilkinson et al., Proceedings of theIEEE Information Visualization 2005, pages 157-164

Scatter plots are meaningful when they contain more than a single pointper pane. In particular, views with fewer than five points per pane aregenerally ineffective. Therefore, ineffective views are scored muchlower, resulting in early culling. In implementations that generate only“good” views from the outset, such ineffective views are excluded.

Scatter plots have a different aspect ratio preference from other visualcharts. In particular, roughly square aspect ratios are favorable forperceiving correlations between variables in scatter plots. Like otherview types, scatter plot views that have no scroll bars are preferred.

Line Graphs

Line graphs (also called “line charts”) are commonly used to representquantitative data against a temporal variable. Line charts with onlyflat horizontal lines are the baseline cases that depict a lack ofpattern. Thus, the rank of a line graph is based on showing somevariability or trend. Examples include peaks or troughs in the trendlines, clusters of lines with similar trends, or outlier trend lines.Some implementations identify repeating patterns of peaks and/ortroughs. Scoring functions quantify the amount of variability and extentof a trend.

Line charts with too many lines that intersect, overlap, or are tooclosely spaced are harder to read. On the other hand, line charts withonly a few lines more effectively display patterns and trends.Therefore, scoring functions rank more highly those views with fewerlines per pane. For example, when the lines correspond to a categoricaldata field, the score is related to the cardinality of the data field.In some implementations, a cardinality of 5 receives the highest score.Some implementations also measure the extent to which the lines crosseach other or are spaced apart (e.g., even three lines can produce apoor data visualization if the lines are close together and crisscrosseach other frequently). FIGS. 12A and 12B below illustrate some of thesefeatures of line graphs.

Maps

Symbol maps are generally preferred over filled maps because people arebetter able to perceive size variation than color differences. In someimplementations, a scoring function for maps ranks small multiples offilled maps in the same way as pie charts on maps. Both options revealstructure in the data for different analytical tasks, so in the absenceof knowledge about the user's task, both types are useful. In someimplementations, the pie charts have a small number of splittingcategories. In particular, when the cardinality of the category formingthe basis for the pie chart is large, the pie-map view is not as useful.

In addition, map views with vertical aspect ratios and views that do nothave scroll bars are preferred. In some implementations, scoringfunctions look at the data distribution to determine how well particularvisual encodings work for the selected data fields. Size is the mostrestrictive encoding. Encoding data based on size is roughly equivalentto applying a square root transform and representing the result. If thetransform results in uniformly distributed data, then it is generallynot a good measure to encode with size. Also, since the size isproportional to the data value, it is preferable to encode data with arange closer to zero for size encoding because it results in a biggerrange of sizes. In some implementations, a numeric range for a measureis transformed (e.g., using a linear transformation) to make sizeencoding more useful.

Size encoding is generally not appropriate when a numeric field can takeon negative values. For example, if a numeric field represents acompany's monthly profit, there would be a problem if the company lostmoney during some months. In some instances, however, negative valuescan be avoided by a transformation, such as converting temperaturereadings on the Celsius scale to the Kelvin scale.

Color is a very flexible encoding method because it can representmeasures regardless of range, including ranges that straddle zero. Colorencoding may not be particularly useful for highly skewed data becausefew values are represented by the highest intensity and all the othervalues are flattened to the lower intensities (or vice versa). On theother hand, such an encoding may draw attention to outliers in the data,which may be of interest to the user. Previous feedback from the user(or a cohort of users) may indicate whether such an encoding isdesirable or not. Color can also represent categorical variables withsmall cardinality. In some implementations, color encoding forcategorical variables with a cardinality of ten or less is consideredgood (i.e., ranked high), but the scoring decreases as the cardinalityincreases beyond ten. When there are too many colors, they becomedifficult to discriminate.

Shape is perceptually hard to discern when there are more than tendistinct shapes plotted in a view. However, when the shapes aredistinctive or there is a small number of them, shape can be aneffective way of communicating additional information.

The ranking criteria identified above for text tables, bar charts,scatter plots, line graphs, and maps are not exhaustive, and areexpected to vary over time as further empirical data is collected aboutwhat types of data visualizations are useful. In addition,implementations apply similar criteria to other types of datavisualizations, such as treemaps, network diagrams, bubble plots, and soon. Further, the weighting of the criteria varies based on userpreferences, feedback from individual users, and aggregated feedback.

In some implementations, the scores within each view type are combinedto form a single overall ranking. In some implementations, merging theranked lists of views of different types involves a number of differentconsiderations that are combined. The considerations include favoringmap views when the set of data fields contains a geographic field andnot more than two measures. In general, maps can encode a maximum of twomeasures, one measure corresponding to the size of the geographicallypositioned symbols and one measure corresponding to the color of thosesymbols. Line charts are favored when the set of data fields contains atemporal field. A line chart naturally represents the continuity oftime, making it easier to see trends, consistent patterns, and outlyingbehavior. Bar charts are favored over scatter plots when more than twomeasures are selected because it is easier to see the overall trend ofmultiple measures aligned together and make relative comparisons on thevalues across the measures. A scatter plot is favored when exactly twomeasures are selected along with any number of other fields, because itis generally the best visual representation to understand the bivariatedata relationship between the two measures. Large views are almostalways disfavored, including large text tables with a large number ofempty cells or large bar charts that require scrolling on the height andwidth for exploration. Also disfavored are small multiples of maps orscatter plots in which each pane is small, which makes the whole displaydifficult to read.

In some implementations, in addition to the views that use exactly theset of data fields selected by the user, additional alternative viewsare identified based on modified sets of data fields. In someimplementations, the set of alternative views is presented to the userseparately. Within the set of alternative views, the ranking has anadditional factor, which is the extent to which the modified set of datafields differs from the original user-selected set of data fields. Thegreater the differences, the lower the weight, regardless of how goodthe data visualization is (even a “great” data visualization is notuseful if it is not what the user wants).

In some implementations, all of the views are ranked together andpresented to the user in a single list. In this case, merging the twolists has some additional factors. In general, there is a preference forthe best views that include the exact set of data fields selected by theuser. Large views are down weighted. This includes large tables, complexviews, or large groups of small multiples, even if the large viewsinclude the exact set of user-selected data fields. Large or complexviews that require scroll bars for navigation or represent a large setof data fields sacrifice their analytic value at the expense ofrepresenting all the data. In some instances, different views of subsetsof the data are more meaningful (e.g., applying a filter). Someimplementations favor views that use a subset of the data fields whenthe number of user selected data fields exceeds some threshold.Conversely, some implementations favor views with a superset of theuser-selected data fields when the number of user-selected data fieldsis less than some threshold.

FIG. 1 illustrates the context in which some implementations operate. Auser 100 interacts with a computing device 102, such as a desktopcomputer, a laptop computer, a tablet computer, a mobile computingdevice, or a virtual machine running on such a device. An examplecomputing device 102 is described below with respect to FIG. 2 ,including various software programs or modules that execute on thedevice 102. In some implementations, the computing device 102 includesone or more data sources 236 and a data visualization application 222that the user 100 uses to create data visualizations from the datasources. That is, some implementations can provide data visualization toa user without connecting to external data sources or programs over anetwork.

However, in some cases, the computing device 102 connects over one ormore communications networks 108 to external databases 106 and/or a datavisualization server 104. The communication networks 108 may includelocal area networks and/or wide area networks, such as the Internet. Adata visualization server 104 is described in more detail with respectto FIG. 3 . In particular, some implementations provide a datavisualization web application 320 that runs wholly or partially within aweb browser 220 on the computing device 102. In some implementations,data visualization functionality is provided by both a local application222 and certain functions provided by the server 104. For example, theserver 104 may be used for resource intensive operations.

FIG. 2 is a block diagram illustrating a computing device 102 that auser uses to create and display data visualizations in accordance withsome implementations. A computing device 102 typically includes one ormore processing units/cores (CPUs/GPUs) 202 for executing modules,programs, and/or instructions stored in memory 214 and therebyperforming processing operations; one or more network or othercommunications interfaces 204; memory 214; and one or more communicationbuses 212 for interconnecting these components. The communication buses212 may include circuitry that interconnects and controls communicationsbetween system components. A computing device 102 includes a userinterface 206 comprising a display device 208 and one or more inputdevices or mechanisms 210. In some implementations, the inputdevice/mechanism 210 includes a keyboard; in some implementations, theinput device/mechanism includes a “soft” keyboard, which is displayed asneeded on the display device 208, enabling a user to “press keys” thatappear on the display 208. In some implementations, the display 208 andinput device/mechanism 210 comprise a touch screen display (also calleda touch sensitive display). In some implementations, memory 214 includeshigh-speed random access memory, such as DRAM, SRAM, DDR RAM, or otherrandom access solid state memory devices. In some implementations,memory 214 includes non-volatile memory, such as one or more magneticdisk storage devices, optical disk storage devices, flash memorydevices, or other non-volatile solid state storage devices. Optionally,memory 214 includes one or more storage devices remotely located fromthe CPU(s)/GPUs 202. Memory 214, or alternately the non-volatile memorydevice(s) within memory 214, comprises a computer readable storagemedium. In some implementations, memory 214, or the computer readablestorage medium of memory 214, stores the following programs, modules,and data structures, or a subset thereof:

-   -   an operating system 216, which includes procedures for handling        various basic system services and for performing hardware        dependent tasks;    -   a communications module 218, which is used for connecting the        computing device 102to other computers and devices via the one        or more communication network interfaces 204 (wired or wireless)        and one or more communication networks 108, such as the        Internet, other wide area networks, local area networks,        metropolitan area networks, and so on;    -   a web browser 220 (or other client application), which enables a        user 100 to communicate over a network with remote computers or        devices. In some implementations, the web browser 220 executes a        data visualization web application 320 provided by a data        visualization server 104 (e.g., by receiving appropriate web        pages from the server 104 as needed). In some implementations, a        data visualization web application 320 is an alternative to        storing a data visualization application 222 locally;    -   a data visualization application 222, which enables users to        construct data visualizations from various data sources. The        data visualization application 222 retrieves data from a data        source 236, then generates and displays the retrieved        information in one or more data visualizations. In some        instances, the data visualization application invokes other        modules (either on the computing device 102 or at a data        visualization server 104) to identify a set of good data        visualizations based on the user's selection of data fields, as        described in more detail below;    -   the data visualization application 222 includes a data        visualization identification module 224, which uses a set of        data fields selected by the user, and identifies or generates a        set of possible data visualizations based on the set of selected        fields;    -   the data visualization application 222 includes a ranking module        226, which takes a set of possible data visualizations for a set        of data fields, and ranks the possible data visualizations        according to a set of ranking criteria 228. This process is        described in more detail below;    -   in some implementations, the data visualization application 222        stores user preferences 230, which may be used by the        identification module 224, the ranking module 226, or for other        aspects of the data visualization application 222. The user        preferences may include preferences that are explicitly stated        and/or preferences that are inferred based on prior usage. The        preferences may specify what types of data visualizations are        preferred, the preferred data visualization types based on the        data types of the selected data fields, preferences for visual        encodings (such as size, shape, or color), weighting factors for        the various ranking criteria (e.g., inferred by prior        selections), and so on. Some implementations also provide for        group preferences, such as preferences for a financial group or        preferences for a marketing or sales group. Some implementations        also identify the aggregate preferences of all users (“the        wisdom of the herd”). Some implementations allow both individual        and group preferences. Some implementations enable multiple        levels of user preferences. For example, a user may specify        general preferences as well as preferences for a specific data        source or specific fields within a data source. For example, a        user may have a specific preferred set of shape, size, or color        encodings for the product lines within a company;    -   in some implementations, the data visualization application 222        stores data in a history log 232 for each data visualization        created by the user 100. In some implementations the history log        232 is used to directly or indirectly identify future data        visualizations for the user and/or for other users. In some        implementations, a history log 232 is stored at a server 104 in        addition to or instead of a history log 232 stored on the        computing device 102. An example history log 232 is illustrated        in FIG. 14 ;    -   in some implementations, the ranking module 226 stores data in a        ranking log 234 for each data visualization option evaluated for        a user. In some implementations the ranking log 234 is used to        evaluate and adapt the ranking process in order to provide each        user with better options based on previous selections. An        example ranking log 234 is illustrated in FIG. 15 ; and    -   one or more data sources 236, which have data that may be used        and displayed by the data visualization application 222. Data        sources 236 can be formatted in many different ways, such as        spreadsheets, XML files, flat files, CSV files, text files,        desktop database files, or relational databases. Typically the        data sources 236 are used by other applications as well (e.g., a        spreadsheet application).

Each of the above identified executable modules, applications, or setsof procedures may be stored in one or more of the previously mentionedmemory devices, and corresponds to a set of instructions for performinga function described above. The above identified modules or programs(i.e., sets of instructions) need not be implemented as separatesoftware programs, procedures, or modules, and thus various subsets ofthese modules may be combined or otherwise re-arranged in variousimplementations. In some implementations, memory 214 may store a subsetof the modules and data structures identified above. Furthermore, memory214 may store additional modules or data structures not described above.

Although FIG. 2 shows a computing device 102, FIG. 2 is intended more asa functional description of the various features that may be presentrather than as a structural schematic of the implementations describedherein. In practice, and as recognized by those of ordinary skill in theart, items shown separately could be combined and some items could beseparated.

FIG. 3 is a block diagram illustrating a data visualization server 104,in accordance with some implementations. A data visualization server 104may host one or more databases 106 or may provide various executableapplications or modules. A server 104 typically includes one or moreprocessing units (CPUs/GPUs) 302, one or more network interfaces 304,memory 314, and one or more communication buses 312 for interconnectingthese components. In some implementations, the server 104 includes auser interface 306, which includes a display device 308 and one or moreinput devices 310, such as a keyboard and a mouse.

Memory 314 includes high-speed random access memory, such as DRAM, SRAM,DDR RAM, or other random access solid state memory devices, and mayinclude non-volatile memory, such as one or more magnetic disk storagedevices, optical disk storage devices, flash memory devices, or othernon-volatile solid state storage devices. Memory 314 may optionallyinclude one or more storage devices remotely located from theCPU(s)/GPUs 302. Memory 314, or alternately the non-volatile memorydevice(s) within memory 314, includes a non-transitory computer readablestorage medium. In some implementations, memory 314 or the computerreadable storage medium of memory 314 stores the following programs,modules, and data structures, or a subset thereof:

-   -   an operating system 316, which includes procedures for handling        various basic system services and for performing hardware        dependent tasks;    -   a network communication module 318, which is used for connecting        the server 104 to other computers via the one or more        communication network interfaces 304 (wired or wireless) and one        or more communication networks 108, such as the Internet, other        wide area networks, local area networks, metropolitan area        networks, and so on;    -   a data visualization web application 320, which may be        downloaded and executed by a web browser 220 on a user's        computing device 102. In general, a data visualization web        application 320 has the same functionality as a desktop data        visualization application 222, but provides the flexibility of        access from any device at any location with network        connectivity, and does not require installation and maintenance;    -   a data visualization identification module 224, which may be        invoked by either the data visualization application 222 or the        data visualization web application 320. The identification        module was described above with respect to FIG. 2 , and is        described in more detail below;    -   a ranking module 226, which may be invoked by either the data        visualization application 222 or the data visualization web        application 320. The ranking module was described above with        respect to FIG. 2 , and is described in more detail below; an        analytic module 322, which analyzes the data visualization        history log 232 (either for a single user or multiple users). In        some implementations, the analytic module 322 infers user        preferences 230 based on the data in the history log (e.g., what        types of data visualizations the user prefers, what visual        encodings the user prefers, and so on). In some implementations,        the analytic module uses history log data 232 from multiple        users to infer aggregate preferences 324. In some instances, the        aggregate preferences are for a well-defined group of        individuals, such as the employees in a company's finance        department. In some instances, the aggregate preferences pertain        to specific data fields from a specific data source 236 (e.g.,        encode certain data fields in a specific way). In some        instances, the analytic module 322 identifies aggregate        preferences 324 on a more global level, such as a preference to        use a map data visualization when the selected data fields        include a geographic location. In some instances, the analytic        module 322 identifies preferences based on the data types of the        data fields (e.g., if two numeric fields, one date field, and        one categorical field are selected, what types of data        visualizations are preferred). In some implementations, machine        learning (e.g., a neural network) is used to infer global        preferences;    -   one or more databases 106, which store data sources 236 and        other information used by the data visualization application 222        or data visualization web application 320;    -   in some implementations, the database(s) 106 stores the ranking        criteria 228 that are used by the ranking module 226. Examples        of ranking criteria 228 and how they are applied and combined        are described in more detail herein. In some implementations,        the ranking criteria 228 and/or the weighting of the ranking        criteria is updated over time by the analytic module 322 as        additional data about actual usage is collected and analyzed;    -   in some implementations, the database(s) 106 store user        preferences 230, which was described in more detail above with        respect to FIG. 2 ;    -   the database(s) 106 store a history log 232, which specifies the        data visualizations actually selected by users. Each history log        entry includes a user identifier, a timestamp of when the data        visualization was created, a list of the data fields used in the        data visualization, the type of the data visualization        (sometimes referred to as a “view type” or a “chart type”), and        how each of the data fields was used in the data visualization.        In some implementations, an image and/or a thumbnail image of        the data visualization is also stored. Some implementations        store additional information about created data visualizations,        such as the name and location of the data source, the number of        rows from the data source that were included in the data        visualization, version of the data visualization software, and        so on. For security and/or data privacy reasons, some        implementations modify, limit, and/or encrypt certain data        before storage in the log 232 (e.g., some implementations        anonymize the data). A history log 232 is illustrated below in        FIG. 14 ;    -   in some implementations, the ranking module 226 stores data in a        ranking log 234 for each data visualization option evaluated for        a user. In some implementations the ranking log 234 is used to        evaluate and adapt the ranking process in order to provide each        user with better options based on previous selections. An        example ranking log 234 is illustrated in FIG. 15 ; and    -   in some implementations, the database(s) 106 store aggregate        preferences 324, which are inferred by the analytic module 322,        as described above.

Each of the above identified executable modules, applications, or setsof procedures may be stored in one or more of the previously mentionedmemory devices, and corresponds to a set of instructions for performinga function described above. The above identified modules or programs(i.e., sets of instructions) need not be implemented as separatesoftware programs, procedures or modules, and thus various subsets ofthese modules may be combined or otherwise re-arranged in variousimplementations. In some implementations, memory 314 may store a subsetof the modules and data structures identified above. Furthermore, memory314 may store additional modules or data structures not described above.

Although FIG. 3 shows a server 104, FIG. 3 is intended more as afunctional description of the various features that may be presentrather than as a structural schematic of the implementations describedherein. In practice, and as recognized by those of ordinary skill in theart, items shown separately could be combined and some items could beseparated. In addition, some of the programs, functions, procedures, ordata shown above with respect to a server 104 may be stored on acomputing device 102. In some implementations, the functionality and/ordata may be allocated between a computing device 102 and one or moreservers 104. Furthermore, one of skill in the art recognizes that FIG. 3need not represent a single physical device. In many implementations,the server functionality is allocated across multiple physical devicesthat comprise a server system. As used herein, references to a “server”or “data visualization server” include various groups, collections, orarrays of servers that provide the described functionality, and thephysical servers need not be physically colocated (e.g., the individualphysical devices could be spread throughout the United States orthroughout the world).

FIG. 4 illustrates a process flow for identifying and ranking datavisualizations in accordance with some implementations. In this example,the data source 236 as well as the user preferences 230, history log232, and aggregate preferences 324 are stored in a database 106, whichmay be accessed over a network 108 or stored locally on a computingdevice 102 of the user 100. The user 100 selects (420) a set of datafields 402 from the data source(s) 236. The user wants to create a datavisualization that includes these fields.

In some implementations, the data visualization identification module224 takes the selected set of data fields 402, and identifies (422)alternative modified sets of data fields 404. The modified sets includesupersets of the selected fields 402, subsets of the selected fields,sets of fields in which different filters are applied, sets in which oneor more fields is replaced by another field (such as a hierarchicallybroader or narrower field), and so on. In some instances, when supersetsor subsets are selected, the selection is based on semantic relatednessof the fields. For example, a superset may include an additional fieldthat is related to the other fields. In another example, a field may beremoved because it is not semantically related to the other fields. Inpractice, the alternative sets of data fields 404 are typically closelyrelated to the original set of data fields 402 selected by the userbecause the goal is to create data visualizations that display what theuser wants. This process is described in more detail below with respectto FIGS. 6A and 6B.

For each set of data fields, the data visualization identificationmodule 224 identifies (424) possible data visualizations 406 to displaythe data fields in the set. In some implementations, all possibilitiesare identified. In some implementations, all possibilities are initiallyidentified, but many are culled based on simple evaluation criteria.This avoids applying the full evaluation process to a large number ofpossible data visualizations, which is generally useful because many ofthe options can be quickly dismissed as not being as good as otheroptions. In some implementations, the identification module 224 operatesmultiple threads in parallel. For example, some implementations use aseparate thread for each of the basic view types. In someimplementations, the identification process is further subdivided inorder to identify all the options more quickly. In some implementations,the parallel processing uses map-reduce technology, and may be combinedwith the ranking phase.

The ranking module 226 ranks (426) the identified data visualizations406 to form a ranked list 408. In some implementations, the ranked list408 includes only a small number of top ranked entries (e.g., the topfive or ten recommended data visualizations). In some implementations,the ranking module 226 ranks all of the possible data visualizations 406after all of the options have been identified. In some implementations,the ranking module 226 ranks each data visualization as it isidentified. In particular, when the identification process 424 operatesin parallel, the ranking process 426 operates in parallel as well. Insome implementations, the scores used for ranking comprise two scores: afirst score based on comparing data visualizations within a specifiedview type, and a second score based on the view type itself. In theseimplementations, the first score represents how well the proposed datavisualization stacks up against other visualizations of the same type(taking into account the specific data fields selected). The secondscore represents how well a certain view type is able to represent theselected fields (e.g., a map generally represents data well when thereis a geographic component).

For the final rankings, all of the data is used (subject to any appliedfilters). However, in earlier stages of the process, someimplementations compute a preliminary ranking based on a subset of thedata (i.e., less than all of the rows from the data source). For a verylarge data source, a preliminary ranking may be based on a small subsetof the rows, such as 1% or 5%. Some implementations use a random sampleor other sampling technique.

As described herein, various criteria may be used to compute the scores,and each criterion may be assigned a distinct weight in the overallscoring process. In some implementations, the weighting is linear, suchas s=w₁c₁+w₂c₂+ . . . +w_(n)c_(n), where s is the overall score, c₁, c₂,. . . , c_(n) are the criteria, and w₁, 2 ₂, . . . , w_(n) are theweights for the corresponding criteria. In some implementations, theweights are adjusted over time based on actual user selection of datavisualizations. In some implementations, the weights are adjusted oradapted to individual user preferences or the preferences of a cohortgroup of users. In some implementations, the weighting of the criteriais non-linear. Each criterion may be based on several factors, such asthe values of multiple data fields. In some implementations, somecriteria apply to all of the possible data visualizations 406, whereasother criteria are applicable to only data visualizations of certainview types. This is also described with respect to FIG. 5 .

Once the data visualizations are ranked (426), the ranked datavisualizations are presented (428) to the user. A sample presentation isillustrated in FIG. 13 . Some implementations limit the number of datavisualizations presented (428) to the user 100. In some implementations,the number presented is a user configurable parameter. In someimplementations, the presentation screen includes a button or othervisual control to see additional options. For example, in someimplementations, the top five data visualizations are presented to theuser. If the user wants to see additional options, the user may selectthe “More” button to see the data visualizations ranked 6-10. Pressingthe button additional times displays further options that were rankedeven lower.

FIG. 5 illustrates a process where the data visualizations areidentified and evaluated for each view type separately, then mergedtogether at the end. Some implementations use map-reduce technology forthis process to reduce the overall time. However, the processing foreach view type can occur serially (e.g., when there are insufficientresources for parallel processing). In this illustration, the processstarts with a single set of data fields 402, but the same processescould be applied to multiple alternative sets of data fields 404simultaneously. For example, some implementations assign a distinctexecution thread to each (view type, data field set) combination, andperform a merge at the end. In other implementations, a thread isassigned to each view type, and within that view type all of thealternative sets of data fields 404 are considered together (e.g.,serially).

Within a data visualization application 222 (or web application 320),there is a fixed set of supported view types 502. (Of course a newversion of the software may support additional view types.) In FIG. 5 ,there are n view types, labeled as view types 502-1, 502-2, 502-3, . . ., 502-n, where n is a positive integer. In typical implementations, n isan integer between five and ten. Within each of these view types, theidentification module 224 identifies (424) a set of data visualizationswith that view type. In this illustration there are n distinct viewtypes, so there are n distinct identification processes, each running aninstance of the identification module 224 (i.e., processes 424-1, 424-2,424-3, . . . , 424-n). In some implementations, the identificationmodule 224 comprises a set of programs, procedures, or methods, with adistinct program (or procedure or method) for each of the view types. Insome implementations, the identification phase is top down: identify alloptions, then cull the ones that can be easily recognized as not good.Other implementations use a bottom up approach, generating only theoptions that are considered sufficiently good.

Once the possible data visualizations within a view type are identified,the ranking module 226 ranks (426) them against each other. Someimplementations use a scoring function, and the data visualizations withthe highest scores are ranked the highest. Because each view type hasspecific advantages and disadvantages, the ranking module typically hasa distinct scoring function for each of the view types. As noted withrespect to FIG. 4 , a scoring function is based on a set of weightedcriteria. Some of the criteria are shared across multiple view types,but even when criteria are shared, they may be weighted differently fordifferent view types. For example, the presence or absence of scrollbars is a criterion that generally applies to all view types, but fortext tables there is a greater tolerance for vertical scroll bars. Inaddition, sometimes user preferences or user history affects theweighting of criteria. For example, a user who is very comfortable withlarge spreadsheets may be less bothered by horizontal scroll bars in adata visualization, and thus the criterion to downgrade datavisualizations with horizontal scroll bars may be weighted less oreliminated entirely. Some examples of the criteria the ranking module226 uses are illustrated below in FIGS. 7A, 7B, 8A, 8B, 9A-9C, 10A, 10B,11A, 11B, 12A, and 12B. In some implementations, the ranking process 426culls all options with scores below a certain threshold level (which maybe different for different view types).

Depending on the selected data fields 402, different types of datavisualization are empirically better or worse at conveying theinformation from those data fields. Therefore, the overall score for adata visualization includes a portion that is based just on the viewtype. In some implementations, the scoring based on view type isincluded in the ranking process 426 for each view type, and thus themerge process 504 entails sorting all of the data visualizations basedon their overall scores. In other implementations, the scores for viewtype are accounted for in the merge process, which is sometimesnon-linear (e.g., more complex than just adding a fixed number to eachscore based on the view type of each data visualization). Furthermore,the merging process may occur after the scoring within each view type(as illustrated), or as a continuous process. For example, if all of thethreads are executing on a single physical device, some implementationsmaintain the single ranked list 408 in memory or other data storage atthat device. However, in a map-reduce implementation that uses multipledistinct physical devices, implementations typically store individualranked lists locally for each view type and merge 504 at the end.

In implementations that include alternative modified sets of data fields404, there can be additional merging. In some implementations, all ofthe data visualizations are considered together, and the views withhighest overall rank are displayed to the user in a single ranked list408. In some implementations, these additional data visualizations areidentified (424) and ranked (426) together with the data visualizationsbased on the exact set of data fields 402 selected by the user. Thealternatives are downgraded according to the extent of modification(e.g., having one criterion that measures the amount of modificationfrom the base set 402, and including this criterion in each scoringfunction). In other implementations, these alternatives are processed onseparate threads, and merged together (504) at the end, with downgradedscores based on altering the set of user-selected data fields. Theranked list 408 of recommendations is presented (428) to the user.

In other implementations, the identified possible data visualizationsthat use exactly the set of data fields selected by the user aredisplayed 428 in one list (e.g., one window), and a second list displaysthe top ranked data visualizations where the set of data fields has beenmodified in at least one way.

FIGS. 6A and 6B illustrate ways in which a user selected set of datafields 402 can be modified to form an alternative set of data fields.Because the user has specifically selected a set of data fields 402,most implementations limit the modifications (e.g., replacing theselected set of fields with a different set of fields would be a“modification,” but would not represent what the user is seeking).

FIG. 6A identifies a set of fields that are included in various sets offields in FIG. 6B. Field F1 602 is a simple ordinal field, which istypically a character field with a small set of distinct values. Forexample, F1 may represent sales regions or product lines. The notation[f] after a field name indicates that the filter f is applied to thefield. For example, F1[f_(a)] 604 indicates that the field F1 has beenlimited by filter f_(a). In practice, filters can involve a combinationof fields or apply to an aggregate value, but in FIGS. 6A and 6B theexamples are limited to filters that apply to non-aggregated singlefields. The field F1[f_(b)] 606 is the field F1 limited by filter F_(b).For example, if F1 is a field that represents product lines, filterf_(a) and f_(b) could limit the set of product lines (e.g., productlines in the U.S. or product lines for paper products).

Fields F2 608 and F3 612 are quantitative fields which can take on acontinuous range of numeric values (limited by the precision of the datatype). Field F2[g] 610 is the field F2 limited by the filter g. Field F4614 is a date field, such as an order date. Field F4[h] 616 is the fieldF4 limited by the filter h. For example, if F4 is an order date field,the filter h may limit the data to orders in 2015. F4[h].Q 618 andF4[h].M 620 indicate the same date field F4 limited by the filter h, butconverted to a quarter or month. For example, if F4[h] is an order datefield limited to dates in 2015, then F4[h].Q specifies the quarter foreach order date (e.g., one of the values 1, 2, 3, or 4). For F4[h].M,the data is converted to a month (e.g., a number between 1 and 12 or thename of the corresponding month). F4.Y 622 is similar, but does notapply a filter and converts the date data to a year. Finally, F5 624 isanother data field of any type.

In FIG. 6B, the user selected data fields 402 are F1[f_(a)], F2, F3, andF4[h].Q. The identification module 224 identifies (422) alternative setsof data fields 404 that are similar to the set of data fields selectedby the user. Thirteen sample sets are illustrated, including the set{F1[f_(a)], F2, F3, F4[h].Q} 642 selected by the user. The set{F1[f_(a)], F2, F3, F4[h].Q, F5} 644 is a superset, including theadditional field F5 624. The set {F1[f_(a)], F2, F4[h].Q} 646 is asubset, with the field F3 612 removed.

The set {F1[f_(a)], F2, F3, F4[h].Q, F4[h].M} 648 is also a superset,but with a specific structure. The set 648 includes both F4[h].Q andF4[h].M, providing both the quarter and the month corresponding to thedate field F4. The set {F1[fa], F2, F3, F4[h].M} 650 is similar to theoriginal set 642, but has replaced the quarter with the month. This setof data fields would display the same data, but at a finer level ofgranularity. The set {F1[f_(a)], F2, F3, F4.Y} 652 is also similar tothe original set 642, but has replaced the quarter with the year. Inthis example set 652, the filter h has also been removed. A datavisualization with this set of fields would display the data at acoarser level of granularity (by year rather than by quarter).

The set {F1[f_(b)], F2, F3, F4[h].Q} 654 is the same set of fields asthe original set 642, but with a different filter f_(b) applied to thefield F1. Depending on f_(a) and f_(b), data visualizations using thetwo different filters may display more data, less data, or justdifferent portions of the data. The set {F1[f_(a)], F2[g], F3, F4[h].Q}656 has the same set of fields as the original set 642, but has added afilter g for the field F2. The set {F1, F2, F3, F4[h].Q} 658 has thesame set of fields as the original set 642, but has removed the filterf_(a) from the field F1. The set {F1, F2[g], F3, F4[h].Q} 660 has thesame set of fields as the previous example set 658, but has added thefilter g for the field F2.

Each of the last three example sets has two or more changes from theoriginal set 642. The set {F1, F2, F3, F4[h].Q, F5} 662 has added thefield F5 and removed the filter f_(a) from field F1. The set {F1[f_(b)],F3, F4[h].Q} 664 has removed the field F2 and switched from filter f_(a)to filter f_(b) for field F1. Finally, the set {F1[f_(b)], F3, F4[h].Q,F5} 666 has removed the field F2, added the field F5, and switched fromfilter f_(a) to filter f_(b) for field F1. Because of the three changesto the set of data fields, it would be downgraded substantially.

The various example sets in FIG. 6B illustrate some of the ways that aset of data fields may be modified to create alternative datavisualizations. Some implementations downgrade the ultimate rankingsdifferently depending on the type of modification and what the set wasoriginally. For example, if a user has selected many data fields, addingadditional fields would be heavily downgraded, whereas removing fieldsto form a subset may be downgraded only slightly. Conversely, if theuser has selected only a small number of fields, then adding more fieldsmay be useful, particularly if the added fields are semantically relatedto the selected fields. Implementations typically limit the number ofmodification that will be considered, both because of the deviation fromwhat the user has requested as well as the high cost of generating andevaluating many more options. In some implementations, the limit is twomodifications.

FIGS. 7A and 7B illustrate the preference for data visualizations thatfit entirely within the display. FIG. 7A is a text table with a pooraspect ratio 700. The table is sparsely populated and requires ahorizontal scroll bar 702 in order to see all of the data. In contrast,the text table in FIG. 7B has a good aspect ratio 704, which fitsentirely within the display. It has a denser display, which is generallynot problematic for a text table. Even if FIG. 7B required a verticalscroll bar (not pictured), it would be preferable to the horizontalscroll bar 702 in FIG. 7A.

FIGS. 8A and 8B illustrate two alternative bar graphs and some criteriafor evaluating them. In FIGS. 8A and 8B, the rows are defined by thepair of fields Loan Status and Loan Sector, but the order of these twofields is different. In FIG. 8A, the Loan Status 802 is the outermostfield and the Loan Sector 804 is the innermost field. With thisarrangement, some of the panes have a large number of rows, such as thefirst pane 806 with 15 rows for different loan sectors. In FIG. 8B, withthe Loan Sector 818 as the outermost field and the Loan Status 820 asthe innermost field, each pane has four or five rows, as indicated bythe identified panes 822, 824, 826, and 828. Visually a user can readilygrasp and remember the data in a pane with four or five rows, but tryingto grasp and remember fifteen rows in the single pane 806 is not easy.Empirical evidence shows that a data visualization with panes havingabout five elements is better for users, so one criterion for bar graphsis to score the potential bar graphs based on the number of rows in theinnermost level of nesting. See, e.g., “The Magical Number Seven, Plusor Minus Two: Some Limits on our Capacity for Processing Information,”George A Miller, The Psychological Review, 1956, vol. 63, pp. 81-97.

In addition, the bar graph in FIG. 8A fails to use the horizontal space.The longest bar is only as long as the measuring line 808, leaving asubstantial amount of white space in the graph. On the other hand, thebar graph in FIG. 8B uses the full extent of the available horizontalspace as indicated by the measuring line 834. Some implementationsinclude criteria that measure the extent to which data visualizationsuse the available space.

The examples in FIGS. 8A and 8B include vertical scroll bars 810 and836. Because they both include scroll bars, it does not change therelative ranking of the data visualizations in these figures. Analternative bar graph that does not include vertical scroll bars mightbe scored even higher than the bar graph in FIG. 8B.

FIGS. 9A, 9B, and 9C are scatter plots that compare three measurablecharacteristics of cars: price, the compression ratio of the engine, andthe horsepower of the engine. If a user selected all three of these datafields, which would be the best scatter plot to recommend? A quickanswer is probably FIG. 9C because it appears to show the greatestcorrelation between variables. FIG. 9A shows the least correlation. Ifonly one of these could be selected, then using FIG. 9C would show thecorrelation, and the compression ratio could be encoded in the marks(e.g., by the size of the marks).

In some implementations, when there are multiple similar options such asthese, a combined data visualization may be created. In fact, such acombined data visualization could be more useful than any oneindividually because it seems to show that price is somewhat correlatedto horsepower (FIG. 9C), but price is not very correlated withcompression ratio.

FIGS. 10A and 10B illustrate two different maps that illustrate somenumeric variable for each of the states in the United States. FIG. 10Ais sometimes referred to as a symbol map and FIG. 10B is sometimesreferred to as a filled map. In the map of FIGS. 10A, the numericvariable is encoded as the size of the circle displayed in each state.It is relatively easy to see that circle 1004 in Illinois is large, thecircle 1008 in Texas is fairly large, the circle 1010 in South Carolinais small, and the circle 1006 in Nevada is very small. But what aboutMontana 1002, where there does not appear to be a circle at all? Thenumeric variable is actually negative for Montana, so there is nostraightforward way for a circle with a positive size to represent anegative value.

FIG. 10B provides a map where each state is filled with a color based onthe same numeric variable used in FIG. 10A. Unlike size, colors can beused effectively to display any ranges of numbers, including negativevalues. In the original color version of FIG. 10B, Montana 1022 iscolored with a pink shade, whereas all of the other states with positivevalues are colored with some shade of green, making it very easy torecognize the outlier. In this black & white rendering, a line patternhas been added for Montana. (Some implementations use fill patterns whencolor is not available.)

Although color facilitates rendering negative values, the color fill maynot be as visually clear when there is no inherent correlation betweencolor and the magnitude of a numeric variable. Here, a user 100 who isfamiliar with the color encoding can recognize that Illinois 1024 hasthe highest value, that Texas 1026 has a large value, South Carolina1030 has a smaller value, and that Nevada 1026 has a relatively verysmall value. In this example, the score for the visualization in FIG.10B is higher than the visualization in FIG. 10A because of the abilityto encode negative values. However, if the numeric variable was alwayspositive (e.g., population), then FIG. 10A might have a higher score.

FIGS. 11A and 11B show scatter plot diagrams. In FIG. 11A, there is nodiscernible pattern (e.g., no clustering, outliers, striation, ormonotonicity), so it would receive a low score. On the other hand, FIG.11B illustrates two statistical features. First, there is an outlier1102, which is highly visible in this view. (Of course it would be up toan analyst to determine whether the outlier is due to an importantconsideration, a fluke, or a problem with the data.) FIG. 11B alsoincludes a clump or cluster 1104, which is a group of points that areclose to each other but distant from other points in the scatter plot.Because of the outlier 1102 and the cluster 1104, the data visualizationin FIG. 11B would be scored more highly than the data visualization inFIG. 11A. In some implementations, the data visualization would scoreeven higher if there were multiple clusters. Techniques to identifyclumps, outliers, and other features in scatter plots are described inmore detail below.

For scatter plots, implementations consider other graphic features aswell. For example, some implementations consider whether the plottedpoints show a monotonic trend, whether the plotted points show acorrelation between the data fields on the axes (e.g., linear,quadratic, or exponential), and whether the plotted points take ondiscrete values for either data field (e.g., the y-values are allapproximately integer multiples of a base value b).

FIGS. 12A and 12B illustrate two line graphs of data for three regions.Typically, line graphs are appropriate when one of the data fields istemporal (e.g., a date, a time of day, or the number of millisecondsafter a starting time in a scientific experiment). In FIG. 12A, the line1212 for the western region 1202 initially increases, stays about thesame, then decreases substantially. The line 1214 for the central region1204 jumps up and back down for each time interval. Finally, for theeastern region 1206, the line 1216 slowly goes down, but then goes backup. None of the lines 1212, 1214, or 1216 has a consistent trend, andthere is no consistency between the lines for the three regions. Theline graph in FIG. 12A would therefore have a low score.

On the other hand, the line chart in FIG. 12B has at least two visiblefeatures. First, the lines 1232, 1234, and 1236 for each of the regions1222, 1224, and 1226 are monotonically increasing. Second, the lines1232, 1234, and 1236 are trending in approximately the same way as eachother. This correlation between the lines is a useful feature. For thesereasons, the line graph in FIG. 12B would be scored more highly than theline graph in FIG. 12A.

One skilled in the art recognizes that monotonicity can be evaluated invarious ways. For example, some implementations use Spearman's rankcorrelation coefficient to measure monotonicity. The raw data (X₁, Y₁),(X₂, Y₂), . . . , (X_(n), Y_(n)) is converted to two sets of ranks {x₁,x₂, . . . , x_(n)} and {y₁, y₂, . . . , y_(n)}, where the ranks are theintegers 1, 2, . . . , n. x₁ is the rank of X₁, x₂ is the rank of X₂,and so on. If x is the mean of the ranks x₁, x₂, . . . , x_(n), and y isthe mean of the ranks y₁, y₂, . . . , y_(n), then the Spearman rankcorrelation coefficient ρ is given by the formula:

${MonotonicityMeasure} = {\rho = \frac{\sum_{i}{\left( {x_{i} - \overset{\_}{x}} \right)\left( {y_{i} - \overset{\_}{y}} \right)}}{\sqrt{\sum_{i}{\left( {x_{i} - \overset{\_}{x}} \right)^{2}{\sum_{i}\left( {y_{i} - \overset{\_}{y}} \right)^{2}}}}}}$

where the index i ranges from 1 to n in each sum. Some implementationstake the absolute value of this calculation to that monotonicallydecreasing relations have a positive value for the monotonicity measure.

To compute monotonicity, some implementations compare the total numberof consecutive pairs of points where the y-coordinate of the secondpoint is either greater than the y-coordinate of the first point, equalto the y-coordinate of the first point, or less than the y-coordinate ofthe first point.

In some implementations, monotonicity values at or close to 1 are theonly ones considered interesting, so smaller values are set to zero. Forexample, if the computed MonotonicityMeasure is less than 0.75, then setit to zero. The monotonicity measures for all of the lines in a linegraph can be combined in various ways, such as summing, averaging, ortaking the maximum.

Even when lines in a graph are not monotonic, it can be useful toidentify when two or more of the lines within the graph have similarshapes by having consistent trends. For example, two lines may generallygo up and down together, such as stock prices for multiple stocks in thesame sector.

Some implementations compute the trending consistency between two linesin a way similar to computing monotonicity. For example, if (x₁, y₁) and(x₂ , y₂) are two consecutive points on a first line, and (x₁, y′₁) and(x₂ , y′₂) are corresponding consecutive points on a second line, thenthe two lines are trending in the same way between x₁ and x₂ when

$\frac{y_{2} - y_{1}}{y_{2}^{\prime} - y_{1}^{\prime}} > 0$

By counting the number of consecutive points where the two lines aretrending in the same way versus trending in opposite directions, thetrending consistency can be measured like monotonicity, as illustratedabove. When there are too many lines and/or too many points, thecomputational cost of comparing all the lines may be too high. Trendingconsistency may be particularly interesting when there are several lineswith the same consistency, as illustrated in FIG. 12B.

FIG. 13 shows an example presentation of the ranked list 408 of topranked data visualizations. Some implementations include the rank 1302in the display. However, some implementations omit the rank fieldbecause the recommended data visualizations are displayed in rank order.Some implementations include a preview 1304 for each of the datavisualizations. In some implementations, the previews are thumbnailimages of the actual data visualizations. In some implementations, thepresentation includes a view type column 1306, which specifies the viewtype for each of the recommended options.

In some implementations, the presentation includes a description column1308, which provides additional notes about each of the recommended datavisualizations. For each presented option, the description 1310 mayspecify which data fields specify the X-positions of graphical marks,which data fields specify the Y-positions of graphical marks, whichfields are used for color, shape, or size encodings, which filters areapplied, and so on. The description 1310 may also specify anymodifications to the set of data fields 402 (e.g., data fields that wereadded or removed).

FIG. 14 illustrates a data visualization history log 232, which tracksdata visualizations selected by one or more users. The datavisualizations in the log 232 can be constructed entirely by the user,constructed by an automated process and selected by the user, or ahybrid construction (e.g., initially generated automatically andsubsequently modified by the user).

When a log 232 supports more than a single user, the log 232 typicallyincludes a user ID 1402 that uniquely identifies the user. In someimplementations, the user ID 1402 is an email address, a network ID, ora user selected ID that is used by the data visualization application222 or web application 320. In some implementations, the date ordate/time 1404 of the user selection is tracked in the log 232.

For each data visualization selected, the log 232 tracks details aboutthe visual specification 1406, which includes various parameters of thedata visualization. The visual specification identifies the list offields 1408 that are included in the data visualization. Some of thefields are data fields taken directly from a data source 236, but otherfields are computed based on one or more data fields. For example, ayear or quarter field may be computed from a date field representing anorder date. Implementations typically group data visualizations into asmall number of distinct view types, such as text tables, bar charts,line charts, maps, and scatter plots. The view type 1410 of a datavisualization is stored in the log 232. In some implementations, some ofthe basic view types have some variations that are classified assubtypes. For these implementations, the subtype is typically stored inthe log 232 as well.

Data visualizations are typically based on a Cartesian layout with rowsand columns. One or more of the fields in the field list 1408 areincluded in the X-position fields 1412 and one or more of the fields inthe field list 1408 are included in the Y-position fields 1414. Theorder of the fields within the X-position fields 1412 and within theY-position fields 1414 is important because the order specifies thehierarchical structure. This was illustrated above with respect to FIGS.7A, 7B, 8A, and 8B. In some instances, the data from the data source 236is aggregated. For aggregated data, the level of detail 1416 specifiesthe grouping. The fields in the level of detail 1416 are similar to theGROUP BY fields in an SQL query.

In some instances, a data visualization uses one or more filter 1418,which are stored in the log 232. The filters limit the rows from thedata source 236 that are selected for visualization. For example,transaction data may be filtered to a specific date range. Filters aresimilar to WHERE clauses in an SQL query.

Data visualizations can use various types of encodings to communicateadditional information. For some view types (e.g., a line chart), afield can be used to specify path encoding 1420, which orders the datain the display according to the path encoding field 1420. For example,consider a line chart that correlates revenue and profit, with revenueused to specify the x-position. By default, the line graph orders thedata from lowest to highest revenue. However, a person might prefer tosee the same data sorted by date, which can be accomplished by using theappropriate date field for path encoding.

A label encoding 1422 specifies labels that are associated withgraphical marks in the data visualizations. A color encoding can assigna color to each graphical mark based on the value in an encoding field.The color encoding 1424 is saved in the log 232. Finally, the size ofvisual marks can be set according to a quantitative field designated forsize encoding. The size encoding 1426 is stored in the log 232. Each ofthe encoding types 1420, 1422, 1424, and 1426 may use a single field,but none is required. In some instances, two or more of the encodingoptions are used for a single data visualization.

In some implementations, when data visualization options are generatedand presented to a user, each of the options has an associated uniqueidentifier 1512, as illustrated in FIG. 15 below. In some of theseimplementations, when a user selects one of those options, the datavisualization option ID 1512 is stored in the history log 232, and actsas a link between the history log 232 (what the user selected) and theranking log 234 (what was presented to the user).

Some implementations store additional information about each datavisualization selected by a user. Some implementations store anidentifier of the data source 236, which may be expressed in variousways depending on the data source type. For example, a spreadsheet maybe specified by a full network path name, and possibly an indicator of aspecific sheet name or number within the spreadsheet. For an SQLdatabase, the data source may be specified by a set of parameters,including the server, database, and a table or view. Someimplementations provide for data blending from two or more data sources,so the log entry for a data source 236 may be a more complex expression.

Some implementations store an image 1428 of the data visualization,which may be a full resolution image, a thumbnail image, or othercompressed image, and may be stored in varying formats (e.g., JPEG,TIFF, PNG, PDF). Some implementations track the software version 1430that was active at the time the data visualization was created. This maybe useful later to identify software bugs, to track changes in thesoftware over time, for statistical analysis of software usage, and soon.

Some implementations store additional pieces of data, which may be usedlater to analyze and improve the ranking process for the individual useror analyze and improve the software. In some implementations, thisincludes the count 1432 of rows that were selected from the data source.Some implementations track the amount of time required to perform theoperations (e.g., the amount of time to retrieve the data).

In addition to the history log 232 of data visualization actuallyselected by the user, some implementations include a data visualizationranking log 234 as illustrated in FIG. 15 , which tracks the datavisualization options that were generated and presented to the user.When the ranking log 234 supports multiple distinct users, the rankinglog 234 typically includes a user ID 1502 that specifies the user forwhom the options were generated. In addition, a date or date/time entry1504 stores when the options were generated. Some implementations alsostore the amount of time used to generate the options, how manyprocessors were used, and other generation parameters.

Data visualization options are generated based on one or moreuser-selected fields 1506 and zero or more user-selected filters 1508.The generation and ranking process creates one or more datavisualization options 1510 that use the user-selected fields 1506 anduser-selected filters 1508 (although some of the data visualizationoptions may modify the set of fields and/or the set of filters). In someimplementations, each data visualization option has an assigned uniquedata visualization option ID 1512. Each data visualization option has anassociated rank 1514, which is stored in the ranking log 234. Note thatthe rank 1514 is the computed rank at the time the option is presentedto the user. If the same data visualization option is presented to theuser in a subsequent ranking process, the rank may be different, even ifbased on the same user-selected fields 1506 and same user-selectedfilters 1508. For example, as more feedback is collected from the user,the weighting of the ranking criteria may be adjusted, or the user mayspecify explicit changes to user preferences.

Some implementations store partial scores 1516 and associated weights1518, as well as other intermediate calculations 1520 that were used bythe ranking process. Examples of partial scores 1516 and intermediatecalculation 1520 are provided below, including DataScore, LayoutScore,Similarity Score, VisualChunking, Sparsity, AspectRatio, ScrollPenalty,PearsonsCorrelation, ClumpyMeasure, StriationMeasure, OutlyingMeasure,MonotonicityMeasure, and VariabilityScore. This data can be used toimprove the ranking process in the future. For example, alternativeweights can be tested to identify rankings that more closely match whatthe user actually selected. By having this raw data, various machinelearning algorithms can be applied.

Some implementations store whether each data visualization option wasselected by the user 1522. In some implementations, selection by theuser is indicated by the history log 232, using the data visualizationoption ID 1512. Some implementations use both ways to show which datavisualization options have been selected by the user.

Each data visualization option has a visual specification 1524, which isanalogous to the visual specification 1406 described above for thehistory log 232. In particular, the field list 1526, the view type 1528,X-position fields 1530, Y-position fields 1532, level of detail fields1534, filters 1536, path encoding 1538, label encoding 1540, colorencoding 1542, and size encoding 1544 have the same meanings ascorresponding named entries in the history log 232, which were describedabove.

FIGS. 16A and 16B illustrates how columns in a data visualization may berearranged to convey information better. In this example, the raw datacomes the FAA, and represents wildlife strikes (typically birds) byairplanes at or near airports (see http://wildlife.faa.gov/). The datais grouped by the amount of damage to the plane (None, Minor, Medium,Substantial, or Destroyed). Within these groupings, four differentquantitative data fields are evaluated. The first data field is thetotal cost for each strike, which is displayed in the Cost Total $ pane1602. A second data field is the number of airplanes damaged, which isshown in the Number Damaged pane 1604. The Number of Strikes pane 1606shows the total number of wildlife strikes in each of the fivegroupings. Finally, the Number of People Injured pane 1608 shows thetotal number of people who were injuring as a result of the wildlifestrikes.

As seen in the Number of Strikes pane 1606, the majority of strikesresult in no damage. The number of strikes that result in a destroyedplane is so small that it does not even register on the bar graph.

When displaying multiple measures side-by-side as in FIGS. 16A and 16B,a user may better comprehend and retain the information when correlateddata fields are placed next to each other. In FIG. 16A, pane 1606 doesnot correlate well with either of the panes 1604 or 1608, and pane 1604does not correlate well with pane 1602. FIG. 16B illustrates anarrangement that has greater total correlation between adjacentmeasures. In particular, pane 1608 correlates fairly well with pane1602, and the pane 1606 that does not correlate with any of the otherthree data fields is placed on the far right so that it is adjacent toonly one other pane.

Some implementations measure correlation between quantitative fieldsusing Pearson's correlation. For example, if Q₁, Q₂, Q₃, and Q₄ are thequantitative fields corresponding to panes 1602, 1604, 1606, and 1608,then the total correlation for the data visualization in FIG. 16A is|corr(Q₁, Q₂)|+|corr(Q₂, Q₃)|+|corr(Q₃, Q₄)|. In FIG. 16B, the totalcorrelation is |corr(Q₁, Q₄)|+|corr(Q₄, Q₂)|+|corr(Q₂, Q₃)|. In thissample formula, the absolute value is used so that negatively correlatedquantitative data fields add to the overall correlation.

FIGS. 17A-17C, 18A-18D, and 19A-19D illustrate various aspects ofprocesses that implementations use to generate and rank datavisualization options. The aspects illustrated in these three flowcharts may be combined in various ways.

FIGS. 17A-17C provide a flowchart of a process 1700, performed (1704) ata computing device 102, for ranking data visualizations (1702) inaccordance with some implementations. The computing device 102 has(1704) one or more processors and memory, and the memory stores (1706)one or more programs for execution by the one or more processors. Inthis flowchart, solid rectangles identify processes or elements that aregenerally required, whereas dashed rectangles identify processed orelements that appear in some implementations.

The user selects a plurality of data fields from a data source 236, andthe computing device receives (1708) that selection. The data source 236may be a SQL database, a spreadsheet, an XML, file, a desktop database,a flat file, a CSV file, or other organized data source. Someimplementations support combined or blended data sources, with data fromtwo or more distinct sources. The data fields may be raw fields from thedata source (i.e., the data field exists in the data source), may becomputed from one or more raw fields (e.g., computing a month, quarter,or year from a date field in the data source), or may be calculatedmetrics computed based on raw data fields, such as a running total oryear over year percentage growth.

In some instances, the user has already specified one or more visuallayout properties, and the device 102 receives (1710) or stores (1710)the user specifications. For example, a user may have alreadyconstructed a data visualization using a set of data fields. The usermay now seek alternative ways to visualize the same set of data (e.g.,using an alternative type of data visualization, such as a bar graphinstead of a text table). As described in more detail below, someimplementations use the visual layout properties specified by the userto tailor the data visualization options that will be presented to theuser.

The data visualization identification module 226 then identifies (1712)a plurality of data visualizations that use a majority of theuser-selected data fields. In some instances, each of the plurality ofdata visualizations uses (1714) each of the user-selected data fields.Because the user has identified specific data fields for inclusion in adata visualization, options that use all of those data fields aregenerally preferred. However, when the user selects a large number ofdata fields, the complexity of evaluating all of the data visualizationoptions increases exponentially, and the importance of each individualdata field diminishes. In fact, if the number of selected fields is toolarge (e.g., exceeding a predefined threshold), each of the plurality ofdata visualizations uses (1716) fewer than all of the user-selected datafields. As illustrated in more detail below with respect to FIGS.19A-19D, the identification module generally identifies some datavisualization options that use exactly the data fields selected by theuser and some data visualization options that use slightly modified setsof data fields.

In some implementations, each of the data visualizations has (1718) aunique view type that specifies how it is rendered. The “view type” isalso referred to as a “chart type” or a “mark type” in somecircumstances. In some implementations, the view types of the datavisualizations are (1720) “text table,” “bar chart,” “scatter plot,”“line graph,” or “map.” Some implementations support additional viewtypes, and/or subdivide these view types further (e.g., bar charts maybe subdivided into stacked bar charts and unstacked bar charts). Asdescribed in more detail below, some implementations use the view typesin the ranking process because different view types may have differentranking criteria.

For each of the plurality of data visualizations, the ranking module 226computes (1722) a score based on a set of ranking criteria. The rankingmodule 226 uses the data values from the user-selected data fields inthe ranking process so that the ranking is specific to the data setactually used. In particular, there may be characteristics of a specificdata set that make certain data visualization options better (or worse)than would be expected based on general rules that use the data types ofthe selected data fields.

At least a first ranking criterion is (1724) based on values of one ormore of the user-selected data fields in the set of data. In someimplementations, the first ranking criterion scores (1726) eachrespective data visualization according to visual structure of values ofone or more of the user-selected data fields as rendered in therespective data visualization. For example, in some instances, thevisual structure includes (1728) clustering of data points. Specifictechniques for measuring clustering in a scatter plot are describedbelow, but generally identify circumstances in which groups of pointsare relatively close to each other but distant from other groups.

In some instances, the visual structure includes (1730) the presence ofoutliers. Some specific techniques for identifying outliers aredescribed below. In some instances, the visual structure includes (1732)monotonicity of rendered data points. Monotonicity may appear in variousview types, including scatter plots, line graphs, and bar charts. To bestrictly monotone, the rendered data points must be strictly increasing,strictly decreasing, strictly non-decreasing, or strictly non-increasing(corresponding to the inequality operators >, ≥, <, and ≤). Of coursethe data points may not be perfectly monotone, so implementationstypically measure the monotonicity (e.g., the data points strictlyincreasing except for one outlier).

In some instances, the visual structure includes (1734) striation of auser-selected data field. A set of data points is identified as striatedwhen a high percentage of the respective values of a data field are(1734) substantially an integer multiple of a single base value. Forexample, a data field whose values are 1.02, 1.01, 2.99, 3.03, 2.00,1.98 is striated because each of the values is approximately an integermultiple of 1. Of course the striated values do not have to be integers.For example, if the values of a data field are −2.24, −0.75, 0.51, 4.76,and 6.03, they are striated because each of these values isapproximately an integer multiple of 0.25.

In some implementations, the first ranking criterion scores (1736) eachrespective data visualization according to one or more aestheticqualities of the respective data visualization as rendered using valuesof one or more of the user-selected data fields. In some cases, theaesthetic qualities measure how well the data visualization conveys thedata to the user (e.g., ease of understanding the data, ease ofretaining the information, etc.). In some instances, the aestheticqualities include (1738) the aspect ratio of the rendered datavisualizations. This is described in more detail below.

In some implementations, the aesthetic qualities include (1740)measuring the extent to which entire rendered data visualizations can bedisplayed on a user screen at one time in a human readable format. Whena data visualization is too large to fit on the screen, a user missesout on the holistic view, which makes it impossible to compare someportions of the display, and making it difficult to find all of thepotentially interesting regions. In some cases the data visualizationcan be scaled to a smaller size so that it fits on the screen, butscaling is limited. A scaled graphic that is a blur is not particularlyuseful because the user would have to zoom in and zoom out in order tosee the details. Displaying a data visualization in a human readableformat means that a user can visualize and use the data without the useof a zoom feature in the user interface. (Even when zooming is notrequired, a person may still use a zoom feature to see the detailbetter.)

In some implementations, the first ranking criterion scores (1742) eachrespective data visualization according to visual encodings of one ormore of the user-selected data fields. As described above with respectto FIG. 14 , implementations support various visual encodings, including(1744) assigning a size, shape, or color to visual marks according tovalues of a user-selected data field. The visual encodings may alsoinclude path encoding, which can be used to sort the rows or columns ina data visualization. The evaluation criteria identify how effective theencodings communicate the data. Based on the range or distribution ofvalues of a data field, certain encodings may be preferred or precluded.For example, if the range of values of a quantitative field includesnegative values, size encoding is generally precluded. On the otherhand, with a highly skewed distribution of quantitative values, acertain color palette may better convey the different values.

In some implementations, the first ranking criterion scores (1746) eachrespective data visualization according to the view type of therespective data visualization and the user-selected data fields.Different view types are better suited for display of different types ofdata, so the ranking process can evaluate each data visualization basedon how well the view types conveys the data from the user-selectedfields. For example, with two independent quantitative fields, a scatterplot is typically an appropriate data visualization. However, based onthe specific data values for the data fields, a scatter plot may not beas effective as another view type.

In some implementations, the set of ranking criteria is (1748)hierarchical, comprising a first set of criteria that ranks view typesbased on the user-selected data fields, and a respective view-specificset of criteria that ranks individual data visualizations for therespective view type based on the user-selected data fields. Theseimplementations take advantage of the fact that comparing (i.e.,ranking) multiple data visualizations of the same view type usesdifferent criteria from comparing data visualizations with differentview types. In some implementations, the criteria for ranking datavisualizations within a single view type use the field values for one ormore of the data fields, whereas the criteria that compare acrossdifferent view types are based on general rules about the data types ofthe user-selected data fields. Other implementations use the fieldvalues to evaluate across view types. Implementations typically computea composite score for each data visualization based on many differentcriteria, with each ranking criterion assigned an appropriate weight.Some implementations adjust the weights of the ranking criteria overtime based on which data visualizations are actually selected by theuser.

In some implementations, the set of ranking criteria includes (1750) asecond ranking criterion that measures the extent to which a datavisualization option is consistent with the user specified visual layoutproperties. As noted above, the user may specify some visual layoutproperties before the identification module 224 or ranking module 226even begin. Some of the visual layout properties are described abovewith respect to FIGS. 14 and 15 . See the visual specification 1406 inFIG. 14 and visual specification 1524 in FIG. 15 . When the user hasspecified certain visual layout properties, data visualizations thatadhere to the user selections are ranked higher than other datavisualization options that deviate from the user selections.

Typically, the ranking module 226 creates (1752) a ranked list of thedata visualization options, where the ranked list is ordered accordingto the computed scores of the data visualizations. The ranked list isthen presented (1754) to the user. If the user selects (1756) one of theoptions from the ranked list, the data visualization application 222displays (1758) the corresponding data visualization on the computingdevice 102.

As illustrated in FIG. 15 , some implementations store information aboutthe ranked data visualizations, including what data fields were selectedby the user, the visual specification 1524 for each of the datavisualization options, as well as other intermediate data that was usedto calculate each of the rankings.

FIGS. 18A-18D provide a flowchart of a process 1800, performed (1804) ata computing device 102, for generating and ranking data visualizations(1802) in accordance with some implementations. The computing device 102has (1804) one or more processors and memory, and the memory stores(1806) one or more programs for execution by the one or more processors.In this flowchart, solid rectangles identify processes or elements thatare generally required, whereas dashed rectangles identify processed orelements that appear in some implementations.

The user selects a plurality of data fields from a data source 236, andthe computing device receives (1808) that selection. The data source 236may be a SQL database, a spreadsheet, an XML file, a desktop database, aflat file, a CSV file, or other organized data source. Someimplementations support combined or blended data sources, with data fromtwo or more distinct sources. The data fields may be raw fields from thedata source (i.e., the data field exists in the data source) or may becomputed from one or more raw fields (e.g., computing a month, quarter,or year from a date field in the data source). In some implementations,the plurality of user-selected fields includes (1810) a plurality ofcategorical data fields. A “categorical” data field is a data field witha limited number of distinct values, which categorize the data. Forexample, a “gender” data field is a categorical data field that may belimited to the two values “Female” and “Male” or “F” and “M”. The set ofuser-selected data fields typically includes one or more quantitativefields as well.

In some instances, the user selects (1812) a filter that applies to afirst user-selected field, which is received (1812) by the datavisualization application 222 or 320. A filter identifies (1814) a setof values for the first user-selected data field, and the datavisualizations are based on limiting values of the first user-selecteddata field to the set of values. For example, a quantitative field withrange 0-1000 could be filtered (i.e., limited) to the range 100-200. Inthis case, the set of values is (1818) an interval of numeric values. Asanother example, a categorical data field whose values are “N,” “S,”“E,” and “W” could be filtered to include only rows with field value=“N”or “S.” In this case, the set of values is (1816) a finite set ofdiscrete values.

In some instances, the user specifies (1820) a single view type, whichis received (1820) by the data visualization application 222 or 320. Inthis case, the data visualization identification module 224 will limitthe considered data visualizations to the single specified view type.

After the user specifies the set of data fields, the data visualizationidentification module 224 generates (identifies) (1822) a plurality ofdata visualization options. Each data visualization option associates(1824) each of the user-selected data fields with a respectivepredefined visual specification feature. Exemplary visual specificationfeatures are described above with respect to FIG. 14 (visualspecification 1406) and FIG. 15 (visual specification 1524). When theuser has selected a single view type, the data visualization options aregenerated (1826) according to the user-specified single view type. Forexample, if the user specifies “line graph” as the view type, then allof the generated data visualization options are line graphs.

In some implementations, the data visualization identification module224 finds (1828) a first set of one or more data visualization optionspreviously presented to the user and not selected by the user. In someof these implementations, the data visualization identification module224 excludes (1830) the first set of data visualization options from thegenerated data visualization options. That is, if they were previouslypresented and not selected, the user may not want to see the sameoptions again. In other implementations, previously presented datavisualizations that were not selected are downgraded, but may still bepresented to the user if they are identified as sufficiently good. Inthis case, some implementations continue to downgrade an option furtherwhen an option is presented and not selected a subsequent time.

In some instances, the data visualization identification module 224identifies (1832) a first user-selected quantitative field in which someof the field values are negative. Such a quantitative field is generallynot suitable for size encoding (unless an appropriate transformationwere applied). Therefore, implementations typically limit (1834) thegeneration to data visualization options that do not encode the size ofgenerated marks according to the first user-selected field.

In some instances, the data visualization identification module 224identifies (1836) a first user-selected field that has a specificdistribution of data values (e.g., uniformly distributed, skewed,bimodal, etc.), and selects (1838) a color palette for encoding thevalues of that data field based on the specific distribution of valuesfor that data field. For example, a simple color gradient may beeffective for a uniform distribution of data values, but might not beeffective to illustrate other distributions. For a skewed or bimodaldistribution of values, using visually distinct colors for differentvalue ranges, or stepped color ranges may be more effective to conveythe value distribution. Once a specific color palette has been selectedbased on the specific distribution of values, implementations typicallylimit (1840) the generation to data visualization options that use theselected color palette for encoding the first user-selected data field.

In some instances, the data visualization identification module 224identifies (1842) three or more distinct quantitative user-selected datafields. In some data visualizations, these quantitative fields areplaced adjacent to each other, as illustrated in FIGS. 16A and 16Babove. As explained with respect to FIGS. 16A and 16B, someimplementations identify (1844) an ordering of the three or moredistinct data fields that maximizes the total pairwise correlationbetween adjacent data fields. When this occurs, implementations limit(1846) the generation to data visualization options that use the firstordering of the three or more data fields.

In some implementations, the data visualization identification module224 identifies (1848) a distribution of values for a first quantitativeuser-selected data field for which a logarithmic scale results in asubstantially linear arrangement of marks. For example, in a scatterplot with two quantitative fields, one of the fields may beapproximately a polynomial function of the other data field. In thiscase, using a logarithmic scale on both axes would result in a set ofpoints that is substantially linear (e.g., not more than 5% variationfrom a line). When this occurs, implementations typically limit (1850)the generation to data visualization options that use a logarithmicscale for the first quantitative user-selected data field.

Some implementations evaluate data visualizations based on “visualchunking.” This was illustrated above with respect to FIGS. 8A and 8B.In FIG. 8A, with Loan Sector 804 as the innermost field for the rows,the chunks are fairly large, as indicated by the grouping 806. However,by switching to Loan Status 820 as the innermost field in FIG. 8B, eachof the chunks has four or five elements, as illustrated by the groupings822, 824, 826, and 828. FIG. 8B illustrates better visual chunking, andis thus preferred.

Some implementations identify data visualizations with better visualchunking by determining (1852) a hierarchical order of the firstplurality of categorical data fields based on measuring the visualchunking of the innermost categorical data field in the hierarchicalorder. In particular, visual chunking of the innermost categorical datafield is measured (1854) by comparing the number of distinct values ofthe innermost data field to a predefined target number. In someimplementations, the target number is 5. When a specific hierarchicalorder of the categorical fields has been identified, implementationstypically limit (1858) the generation to data visualization options thatuse the determined hierarchical order of the first plurality of datafields.

After the set of data visualizations has been identified, the rankingmodule 226 compute (1860) a score for each of the generated datavisualization options based on a set of ranking criteria. In someimplementations, the computation of scores for one or more of the datavisualizations uses (1862) historical data of data visualizationspreviously created for the set of data. For example, the ranking modulemay use data from a history log 232 and/or ranking log 234. Thehistorical data may include visualization created for other users thatuse the same or similar data fields. For example, a new person in afinance department for a company can take advantage of prior work byother individuals in the department because the data visualizationapplication 222 or 320 has stored their prior selections in the historylog 232 and/or ranking log. In particular, the logs store the visualspecifications 1406 and 1524, and thus future ranking (or generation)processes can upgrade the visual layout features from the visualspecifications 1406 or 1524 that were previous selected by users.

In some implementations, the computation of scores for one or more ofthe data visualizations uses (1864) historical data of datavisualizations previously selected by the user. This can includehistorical data for data visualizations based on different data sets ordifferent data fields. For example, a specific user may have preferencesfor certain types of data visualizations (e.g., specific view types) orcertain types of encodings (e.g., a preference for color encoding versussize encoding), and these preferences (as indicated by past selections)may apply across varying data sets.

In some implementations, the computation of scores for one or more ofthe data visualizations uses (1866) a set of user preferences for theuser. As noted above, prior user selections may establish a user'spreferences. In addition, some implementations allow a user to specifypreferences explicitly. An explicit user preference is particularlyrelevant when the user's history is consistent with those preferences.

At least one of the ranking criteria is (1868) based on values of one ormore of the user-selected data fields in the set of data. This wasdescribed in more detail above with respect to FIGS. 17A-17C.

The data visualization application 222 or 320 then creates (1870) aranked list of the data visualization options, where the ranked list isordered according to the computed scores of the data visualizationoptions. Typically, the ranked list is presented (1872) to the user, theuser selects (1872) from the ranked list, and a data visualizationcorresponding to the user selection is displayed (1876) on the user'scomputing device 102.

FIGS. 19A-19D provide a flowchart of a process 1900, performed (1904) ata computing device 102, for ranking data visualizations (1902) inaccordance with some implementations. The computing device 102 has(1904) one or more processors and memory, and the memory stores (1906)one or more programs for execution by the one or more processors. Inthis flowchart, solid rectangles identify processes or elements that aregenerally required, whereas dashed rectangles identify processed orelements that appear in some implementations.

The data visualization application 222 or 320 receives (1908) userselection of a set of data fields from a set of data, and identifies(1910) a plurality of data visualizations that use each data field inthe user-selected set of data fields. This has been described in somedetail with respect to FIGS. 17A-17C and 18A-18D.

In addition to the data visualizations based on exactly the set of datafields selected by the user, some implementations identify (1912) aplurality of alternative data visualizations as well. Each respectivealternative data visualization uses (1914) each data field in arespective modified set of data fields. The modified sets of data fieldsdo not differ too much from the original set of data fields select bythe user because the goal is to identify data visualization options thatare responsive to the user's request. In particular, each respectivemodified set differs (1914) from the user-selected set by a limitedsequence of atomic operations. In some implementations, the sequence ofatomic operations is limited (1916) to two atomic operations.

In some implementations, each of the atomic operations is (1918) one of:

-   -   removing (1920) a single data field from the user-selected set;    -   adding (1922) a single data field to the user-selected set;    -   replacing (1924) a user-selected field with a hierarchically        narrower data field from the set of data;    -   replacing (1926) a user-selected field with a hierarchically        broader data field from the set of data;    -   adding (1928) a filter to a data field that limits values        retrieved to a specified subset of values;    -   removing (1930) a user-selected filter from a data field so that        there is no limit on values retrieved for the data field; or    -   modifying (1932) a filter for a data field, thereby altering        values retrieved for the data field.

These atomic operations were described in more detail above with respectto FIGS. 6A and 6B.

In some instances, at least one of the alternative data visualizationsis (1934) based on a modified set of data fields that differs from theuser-selected set of data fields by including an additional data fieldfrom the set of data. Adding an additional data field is more commonwhen the user-selected set of data fields is small. For the modifiedset, the same generation and ranking techniques described above in FIGS.17A-17C and 18A-18D apply.

In some instances, at least one of the alternative data visualizationsis (1936) based on a modified set of data fields that differs from theuser-selected set of data fields by removing a user-selected data field.Removing a data field is more common when the user specifies a large setof data fields. In some implementations, when the set of user-selecteddata fields is too large, only subsets are considered in the generationprocess. For the modified set, the same generation and rankingtechniques described above in FIGS. 17A-17C and 18A-18D apply.

In some instances, At least one of the alternative data visualizationsis (1938) based on a modified set of data fields that differs from theuser-selected set of data fields by replacing a user-selected data fieldwith a different data field that is hierarchically narrower than theuser-selected data field. When using date fields, a user may havespecifies using year, whereas providing data by quarter or month may bemore useful. As another example, the user may have requested data forproduct lines, and it may be useful to break down each product line intoindividual products. For the modified set, the same generation andranking techniques described above in FIGS. 17A-C and 18A-18 apply.

In some instances, at least one of the alternative data visualizationsis (1940) based on a modified set of data fields that differs from theuser-selected set of data fields by replacing a user-selected data fieldwith a different data field that is hierarchically broader than theuser-selected data field. In this case, having detail at too narrow alevel may present too much “noise,” which may obscure other importantinformation. Therefore, replacing a narrow field with a broader fieldmay provide more information. For the modified set, the same generationand ranking techniques described above in FIGS. 17A-7C and 18A-18Dapply.

In some cases, filters are applied to one or more data fields to limitthe the rows retrieved from the data source 236. In some instances, themodified set of data fields includes modifying the set of filters. Insome instances, at least one of the alternative data visualizations is(1942) based on a modified set of data fields that differs from theuser-selected set of data fields by applying a filter to a user-selecteddata field, thereby limiting values of the user-selected data field to afirst set of values, wherein the filter is not selected by the user. Insome instances, at least one of the alternative data visualizations is(1944) based on a modified set of data fields that differs from theuser-selected set of data fields by removing a user-selected filter fora user-selected data field. In some instances, at least one of thealternative data visualizations is (1946) based on a modified set ofdata fields that differs from the user-selected set of data fields bymodifying a user-selected filter for a data field, thereby alteringvalues retrieved for the data field. In each of these instances, for themodified set, the same generation and ranking techniques described abovein FIGS. 17A-17C and 18A-18D apply.

The ranking module 226 computes (1948) a score for each of the datavisualizations and each of the alternative data visualizations based ona set of ranking criteria. Implementations typically include a rankingcriterion that downgrades data visualization options based on modifiedsets, with the amount of downgrade related to the number of atomicoperations needed to build the corresponding modified set.(Alternatively, upgrade the data visualizations that use an unmodifiedset.) The amount of downgrade also depends on the number ofuser-selected data fields and the specific operation. For example, ifthe user-selected set of fields is small, then an atomic operation toremove one of those user-specified data fields would be heavilydowngraded, whereas an operation to add another field may have only aslight downgrade. In some instances, if the number of user-selectedfields is very small, adding additional fields may not be downgraded atall, especially if the data field added is semantically related to oneor more of the user-selected data fields. On the other hand, if thenumber of user selected fields is large, the downgrade would be smallfor removing one of the user-selected fields, but the downgrade would besubstantial for adding another data field. When removing a data field,there is a preference for removing a field that is not semanticallyrelated to the other user-selected data fields.

For each set of data fields (the original set or a modified set), thereis (1950) at least one ranking criterion that uses values of one or morefields in the set. Because the sets of data fields are different, thecriteria that use data field values can be different.

After all of the data visualizations and alternative data visualizationsare scored and ranked, the data visualization application 222 or 320presents (1952) data visualization options to the user. The presentedoptions correspond (1952) to high scoring data visualizations and highscoring alternative data visualizations. In general, only a small subsetof the options is presented. In some implementations, the user interfaceincludes a button or other object to see more options.

In some implementations, the data visualization options are presented(1954) to the user in a single ranked list that is ordered according tothe computed scores of the data visualizations and the computed scoresof the alternative data visualizations. In this case, all of the optionsare presented together, regardless of whether they are based on theoriginal list of data fields selected by the user or a modified list ofdata fields. In some implementations, when all of the data visualizationoptions are presented together, there is a visual indicator on the listso that the user knows whether each option is based on the original setof data fields or a modified set of data fields.

In some implementations, the data visualization options are presented(1956) to the user in two ranked lists. The first ranked list includes(1956) high scoring data visualizations, ordered according tocorresponding computed scores. The second ranked list includes (1956)high scoring alternative data visualizations, ordered according tocorresponding computed scores.

Typically, the user selects (1958) one of the presented datavisualization options, and the data visualization application displaysthe corresponding data visualization on the computing device 102.

In some implementations, the generated list of options remains availableto the user (e.g., though a menu or toolbar icon). In that way, if theuser selects a first data visualization option and wants to evaluateanother option, the user can go directly to the list rather than goingthrough another generating/ranking process. In some implementations, theranking log 234 includes all of the information needed to build each ofthe ranked data visualizations, and thus the list of ranked datavisualizations can be redisplayed quickly without a generation orranking process. In some implementations, a user can select an olderranked list (e.g, go back to a ranked list from last week).

Some implementations use available resources to pre-create ranked listsof data visualization options based on data fields a user is currentlyusing (e.g., if the set of data fields in use has not been modified fora predefined amount of time, generate a set of data visualizationoptions based on that set of data fields). This can be useful to providea rapid response if a user does ask for data visualization options. Insome implementations, pre-creating data visualization options use morecomplex generation or ranking algorithms because there is not arequirement respond quickly.

In some implementations, the scoring calculation for each identifieddata visualization has three components: a DataScore S_(D), which isbased on how well the data visualization displays statistical propertiesof the data fields; a LayoutScore S_(L), which is based on the aestheticqualities of the data visualization; and a SimilarityScore S_(S), whichis based on how closely the data visualization aligns with userselections. The SimilarityScore does not depend on the view type, butthe DataScore and LayoutScore do depend on the view type. The totalscore T is then computed based on one or more of these three scores. Insome implementations, the total score isT=w_(D)S_(D)+w₁S_(L)+w_(S)S_(S), where the values w_(D), w_(L), andw_(S) are the weights for each of the three partial scores. Typicallyw_(D)>w_(L)>w_(S).

The weights are determined empirically based on actual selection byusers. For example, in some implementations, a history log 232 storesdetails about the data visualization options that are presented to theuser, including the partial scores that were used in the ranking. Thelog also stores which data visualizations the user selects. Using thisdata, weights can be selected to produce rankings that align as close aspossible with the user selections. For example, some implementations usean iterative process that adjusts the weights by small amounts in eachstep. Some implementations define a function F that is a function of thethree weights, where F measures the differences between the computedrankings and the ranking as identified by the user. In each iteration,the process estimates the partial derivatives with respect to theweights, and adjusts the weights accordingly to optimize the function F(i.e., find weights where F is a minimum).

In some implementations, the SimilarityScore S_(S) is just the number ofmatched data fields divided by the total number of selected data fields.A matched data field is one where the usage of the data field in theidentified data visualization is the same as the usage already selectedby the user. For example, if the user has specified field F1 for colorencoding, then there is a match when an identified data visualizationuses the Field F1 for color encoding. A “perfect” score of 1.0 occurswhen the user has specified the usage (e.g., encoding) for all of theselected data fields, and the identified data visualization uses all ofthe fields in that same way. Note that the SimilarityScore S_(S) doesnot incorporate the view type of the data visualization, and it ispossible to have multiple view types use the selected data fields in thesame way. For example, a user may have constructed a bar graph tovisualize certain data, but later wonders if there are alternativebetter ways to visualize the data. Other view types that preserve theuser's selections are preferred, and the preference is accomplished bythe SimilarityScore S_(S).

As noted above, the DataScore and LayoutScore depend on the view type.In some implementations, the scores are computed as illustrated below.

Text Tables

In some implementations, the ordering of categorical data fields isevaluated to favor placing a category with cardinality close to five asthe innermost level of the chart. This leverages the fact that peopleare better able to retain and compare chunks of five (±2) data elements.One way to quantify this criterion computes:

VisualChunking=1−abs(Cardinality(innermostDimension)−5)/5

In addition, some implementations prefer text tables that are denselyfilled, which avoids the distraction of sparsely populated cells. Oneway to quantify this criterion computes:

Sparsity=(number of empty cells)/(total number of cells in the table)

Some implementations combine these two criteria by subtracting, becauseeffective text tables typically have low Sparsity. That is:

DataScore=VisualChunking−Sparsity

Aesthetically, some implementations prefer tables that displaycompletely on the screen. One way to quantify this is whether there arescrollbars in the view. Some implementations differentiate betweenvertical scroll bars and horizontal scroll bars. In addition, someimplementations prefer a table whose visible area has a vertical aspectratio (i.e., height/width>1.0). In some implementations, the LayoutScoreis computed as:

if (horizontal scroll bar and vertical scroll bar)  ScrollPenalty =Value₁ else if (horizontal scroll bar only)  ScrollPenalty = Value₂ elseif (vertical scroll bar only)  ScrollPenalty = Value₃ else ScrollPenalty = 0.00 end if LayoutScore = AspectRatio − ScrollPenalty

Bar Charts

In some implementations, bar charts (also known as bar graphs) sharesome of the same criteria used by text tables. The ordering ofcategories is evaluated to favor placing a category with cardinalityclose to five as the innermost level of the chart. As with text tables,some implementations compute this as:

VisualChunking=1−abs(Cardinality(innermostDimension)−5)/5

In some implementations, the DataScore for a bar chart is based on justthis criterion, so DataScore=VisualChunking.

Similar to text tables, bar charts that fit completely within thedisplay score more highly. When scroll bars are necessary to display thedata, scroll bars that are perpendicular to the bars in the chart arepreferable (e.g., vertical scroll bars when the bars in the chart arehorizontal). Even when there are no scroll bars, the preferred aspectratio depends on the orientation of the bars in the chart. Specifically,a vertical aspect ratio is better with horizontal bars and a horizontalaspect ratio is better with vertical bars. In some implementations, theLayoutScore for a bar graph is computed as:

if (horizontal scroll bar and vertical scroll bar)  ScrollPenalty =Value₁ else if (horizontal bars in chart and vertical scroll bar) ScrollPenalty = Value₂ else if (horizontal bars in chart and horizontalscroll bar)  ScrollPenalty = Value₃ else if (vertical bars in chart andvertical scroll bar)  ScrollPenalty = Value₄ else if (vertical bars inchart and horizontal scroll bar)  ScrollPenalty = Value₅ else ScrollPenalty = 0.00 end if if (vertical bars in chart)  LayoutScore =( 1 / AspectRatio ) − ScrollPenalty else  LayoutScore = AspectRatio −ScrollPenalty end if

In some implementations, the lengths of the bars in a bar chart arealways scaled by the size of the display, so it would not be possible tohave scroll bars in the same orientation as the bars in the chart.

Scatter Plots

A primary objective of a scatter plot is to identify interestingproperties of the data based on visual patterns or shapes in thedisplay. These patterns and shapes include clumps (clusters),monotonicity (positive or negative correlation), striation (presence ofa discrete or integer variable), and outliers. Some implementationspartition the underlying data into multiple panes and compute a scorefor each visible scatter plot chart. The scores for each pane arecombined (e.g., by summing) for an overall score. In someimplementations, a monotonicity score uses Pearson correlation computedover all of the points in the data set. In some implementations, scoresfor striation, dumpiness, and outliers are computed using a minimumspanning tree over the set of points in the data set. Someimplementations use Prim's algorithm to construct the minimum spanningtree.

Some implementations use the following formula to compute Pearson'sCorrelation for a scatter plot:

${PearsonsCorrelation} = {r_{xy} = \frac{\sum_{i = 1}^{n}{\left( {x_{i} - \overset{\_}{x}} \right)\left( {y_{i} - \overset{\_}{y}} \right)}}{\left( {n - 1} \right)s_{x}s_{y}}}$

where x is the mean of x, y is the mean of y, s_(x) is the samplestandard deviation of x, and s_(y) is the sample standard deviation ofy.

In some implementations, the measure of clumpiness uses the formula:

${ClumpyMeasure} = {\max\limits_{j}\left\lbrack {1 - {\max\limits_{k}\left( \frac{{length}(k)}{{length}(j)} \right)}} \right\rbrack}$

where j ranges over the set of edges in the constructed minimum spanningtree and k ranges over edges in each runt set derived from the edge j.For an edge j, the runt sets are formed by removing all edges from theminimum spanning tree that have a length at least as large as the lengthof edge j. The edge j has two endpoints, and each of the runt setsconsists of the remaining edges that are connected to one of thoseendpoints. Because the larger edges are removed, length(k)<length(j) foreach edge k in the runt sets.

In some implementations, striation of a scatter plot is measured as:

${StriationMeasure} = {\frac{1}{T_{2}}{\sum\limits_{v \in T_{2}}{❘{\cos\left( \theta_{v} \right)}❘}}}$

where T₂ is the set of all vertices of degree 2 in a minimal spanningtree T, ∥T₂∥ is the cardinality of T₂, and θ_(v) is the angle formed atthe vertex v using the other two vertices connected to the vertex v. Inparticular, when a scatter plot is heavily striated, the minimalspanning tree typically includes many points that are collinear, andthus the angles θ_(v) are frequently 0 degrees or 180 degrees, in whichcase |cos (θ_(v))|=1.

Some implementations use a minimum spanning tree to calculate a measureof outliers in a scatter plot as well. Within a minimum spanning tree,let q₂₅ be the length of an edge in the minimum spanning tree at the25th percentile and q₇₅ be the length of an edge in the minimum spanningtree at the 75th percentile. Then, let ω=q₇₅+1.5(q₇₅−q₂₅). In someimplementations, a point in a scatter plot is considered an outlier whenit has degree 1 in the minimum spanning tree and the length of the oneedge from the point is greater than ω. Some implementations count thenumber of outliers, typically computed relative to the total number ofpoints in the scatter plot, and weighted appropriately. For example, insome implementations, the outliers are measured as:

${OutlyingMeasure} = {a \cdot \frac{\left( {{number}{of}{outliers}} \right)}{\left( {{total}{number}{of}{points}} \right)}}$

where α is a scaling factor.

Some implementations compute a measure of outliers as the ratio of theedge length from outliers to the total edge length. Specifically:

${OutlyingMeasure} = \frac{{length}\left( T_{outliers} \right)}{{length}(T)}$

where T_(outliers) is the set of edges connecting outliers to the restof the minimum spanning tree.

Some implementations use alternative formulas for the various featuresthat may be present in a scatter plot, and some implementations accountfor additional features such as shape (e.g., convex, skinny, stringy, orstraight), trend (e.g., monotonic), density (e.g., skewed or clumpy), orcoherence. Some of these implementations use formulas or methodsdescribed in “Graph-Theoretic Scagnostics,” L. Wilkinson et al.,Proceedings of the IEEE Information Visualization 2005, pages 157-164,which is incorporated by reference herein in its entirety. Someimplementations combine the individual feature measures as:DataScore=3·abs(PearsonsCorrelation)+2·ClumpyMeasure+StriationMeasure+OutlyingMeasure.

Aesthetically, scatter plots that fit completely on the screen arepreferred. In addition, an overall square display is preferred (i.e.,aspect ratio of 1). In some implementations, a LayoutScore is computedas:

if (scroll bars)  ScrollPenalty = Value₁ else  ScrollPenalty = 0.00 endif if (AspectRatio > 1)  LayoutScore = − ScrollPenalty − (AspectRatio− 1) else  LayoutScore = − ScrollPenalty − ((1 / AspectRatio) − 1) endif

Note that in this example, the best possible layout score is zero.

Line Charts

Some implementations use simple measures of variability and overplottingin order to compute a DataScore for line charts. In some cases, usingmore complex formulas would be too time consuming. In somecircumstances, line charts with high variability (e.g., spikes andtroughs) are preferred (e.g., more interesting). However, in othercircumstances, variability is disfavored. In some implementations, usersmay establish a line graph variability preference, or a variabilitypreference may be inferred for specific data sets or data fields basedon prior usage.

Some implementations measure variability of a line graph by forming astraight line through the first and last point in sequence (typicallytime), then summing up the differences between each intermediate pointand the straight line. Some implementations use a partitioned result setto evaluate each visible line chart and the variability scores for allthe panes are added to compute an overall score. Some implementation uselinear regression to fit the best line for each pane, then comparetrends and variability based on those lines.

Some implementations compute an “overplotting” score, which penalizesdata visualizations that include too many lines. In someimplementations, the penalty is the excess over a specified threshold,such as five or ten. In some implementations, the penalty is thecardinality of the data field dimension that breaks up the view. Someimplementations compute a more precise score using an image spacehistogram (e.g., using 2D binning of the image space).

Some implementations compute a VariabilityScore as:

${VariabilityScore} = {\overset{n - 1}{\sum\limits_{1}}{❘{y_{i} - \left( {{mx}_{i} + b} \right)}❘}}$

where m=(y_(n)−y₀)/(x_(n)−x₀) is the slope of the line between the firstand last points on the line chart, and b=y₀−mx₀ is the y-intercept ofthe line. Some implementations use other methods, such as linearregression, to identify the best line, then compute the variabilityscore as above, but using all of the points on the line chart (includingthe first and last points).

As noted above, implementations use various formulas to compute anOverplottingScore. In some implementations, the OverplottingScore isjust the total number of lines on the line chart, or the excess over athreshold number. Some implementations then combine these two scoresusing DataScore=VariabilityScore−OverplottingScore.

Like other view types, line charts that can be built completely on thescreen are preferred. In addition, a vertical aspect ratio is preferablefor line charts. In some implementations, a LayoutScore is computed as:

if (scroll bars)  ScrollPenalty = Value₁ else  ScrollPenalty = 0.00 endif LayoutScore = AspectRatio − ScrollPenalty

Maps

Some implementations generate small multiples of filled maps as well aspie charts on maps. While both methods reveal structure in the data fordifferent analytical tasks, filled maps are generally more effectivethan pie-maps when there is no prior knowledge of the user's task.Established preferences or historical information for the data fieldsselected can alter the default scoring. As usual, maps that fit on thescreen and vertical aspect ratios are preferred. Some implementationscompute the LayoutScore as:

if (scroll bars)  ScrollPenalty = Value₁ else  ScrollPenalty = 0.00 endif LayoutScore = AspectRatio − ScrollPenalty

In some implementations, all computations to evaluate the views (e.g.,to compute a DataScore and a LayoutScore) are done on the result set.That is, data values for the selected data fields are queried from thedata source and no additional queries are used. Both the generationphase and the ranking phase require some computations on items in theresult set. Some computations in the ranking phase may require apartitioned data set. Ordering of categories breaking down the viewcreates different sets of data points in each pane, which can producedata visualizations that are ranked differently (see, e.g., FIGS. 8A and8B above).

In some implementations, the generation phase uses different builder orculling procedures for each of the different view types. For example,bar charts have different features than scatter plots. In someimplementations, the generation phase uses simple techniques, such aschanging the hierarchy of data fields used to specify the X-positionsand Y-positions of graphical marks in potential data visualizations. Forexample, as illustrated above in FIGS. 8A and 8B, the selection of theinnermost data field can make a cognitive difference for users.

In the generation phase, some implementations evaluate datavisualization options that use small multiples (e.g., splitting thedisplay into multiple panes, where each pane includes an appropriatesubset of data). The small multiples are created by including additionaldata fields (e.g., categorical dimensions) in the definition of theX-positions and/or Y-positions.

For efficiency in the generation phase, some implementations performcertain common calculations first. For example, implementationstypically compute the range of each measure (e.g., a quantitative datafield) to determine whether it straddles zero. If so, the measure isinappropriate for encoding size. Implementations typically compute thespread of each measure to determine how the spread can be optimizedvisually on a display. For example, size encodings typically start thescale at zero. If the smallest value of a data field is too far fromzero (relative to the spread of the variable), then the size variationswould not be highly visible to the user. In that case, using a colorencoding could be more effective because a full color spectrum can bealigned with the range of values of the data field.

Some implementations evaluate the distribution of values for eachselected data field (e.g., skewed versus uniform) to determine bestencodings. For example, some implementations select a color palette thatis appropriate for the distribution (e.g., a simple linear color palettefor a uniform distribution, but a sequence of stepped colors toemphasize the divergent values in a skewed distribution). Evaluating thedistribution of values is also useful in scatter plots and maps whenmeasures are encoded as the size of the marks. For example, encoding thesize based on the log of the data values may be more appropriate whenthe values are growing exponentially or according to a polynomial powercurve.

Some implementations order measures so that the overall correlation,including the correlation between adjacent pairs of data fields, ismaximized. The ordering of data fields is particularly useful for texttables and bar charts, as illustrated above in FIGS. 16A and 16B.

Some implementations evaluate the order of rows or columns based on thevalues of a data field, and sort them accordingly (e.g., if the bars ina bar graph represent sales for each region, the bars may be orderedfrom least sales to greatest sales). In some implementations, when smallmultiples appear in separate panes, the panes may be ordered as well inorder to better illustrate some characteristic of the data.

To limit the large number of potential data visualizations, someimplementations track which data visualizations have been previouslyidentified and thus prevent repetition. Some implementations use aranking log 234, either by itself, or in conjunction with a datavisualization history log 232, which were described above with respectto FIGS. 14 and 15 . In some implementations, this prevents duplicationwithin a single generation phase. In other implementations, some or allof the generated options are tracked so that they are omitted (ordowngraded) in a later generation phase.

In some instances, a user has already constructed a data visualizationbased on a set of data, and has already selected how that data is used(e.g., what data fields specify X-positions and Y-positions of graphicalmarks, what data fields are used for color or size encoding, etc.). Theuser may then seek alternative visualizations of the same data,potentially with a different view type. In this situation,implementations typically track what the user previously selected andgive greater weight to data visualization options that preserve as manyof the user selections as possible. For example, if the user previouslyselected a certain data field for color encoding, then preserving thatcolor encoding is preferred.

As noted above, some scoring aspects are shared across different viewtypes. For example, preferences for fitting an entire data visualizationon the screen and a vertical aspect ratio are commonly used. Computingthese shared aspects at the outset increases efficiency by avoidingduplicate calculations. In addition, some of the view types prefervisual chunks that have cardinality near five, such as in tables and barcharts. Shared functionality is typically implemented in functions,procedures, or methods that can be used by the ranking functions foreach view type.

Some ranking criteria require partitioning of the underlying data. Forexample, some implementations use partitioning to evaluate the “shape”of the data. In some implementations, data in each pane of a scatterplot view is used to compute the correlation, dumpiness, striation, andnumber of outliers, and combines the scores. Some implementations alsopartition the data to evaluate the variability of the data in a linechart. In each pane of a line chart, the ranking process computes thedeviation from a simple linear fit.

Some implementations incorporate various mechanisms to ensure that thegeneration and ranking phases remain responsive even for very large datasets. Some implementations limit the full generation and ranking processto cases where there is a relatively small set of selected data fields(e.g., not exceeding a predefined threshold number of fields). When theselected number of data fields exceeds that threshold, someimplementations display an informational message to the user. In someimplementations, when there are too many fields, various subsets areselected and data visualizations are generated for those subsets. Asnoted earlier, subsets are typically selected based on semanticrelatedness of the data fields in the subset. In some implementations,user preferences or historical selections of data visualizations areused to guide a more limited generation process. Some implementationsuse data visualization options that have been previously generated andranked, even if not previously presented or selected. Someimplementations set a time limit on how quickly the ranked list must beprovided to the user, and present the list at that time based onwhatever options have been evaluated. When a time limit is imposed, someimplementations generate the options based on heuristics of what viewsare most likely to be the best and/or most likely to be selected by theuser. That is, the more likely options are generated and evaluatedfirst.

Because aggregated values from a result set depend on the level ofdetail of the user selected fields, implementations typically cannotprecompute correlation or other scores on the raw data.

Some implementations provide multiple alternative views for a singleview type. In some implementations, the alternative views areessentially subtypes of a basic view type, such as normal bars, stackedbars, and clustered bars within the bar graph view type.

Some implementations enable a user to select a single view type, andgenerate data visualization options within that one view type. In someimplementations, the selected view type includes two or more subtypes.In some implementations, the user is presented with a palette of viewtype options and can select the desired view types (or all). In someimplementations, a user may select specific subtypes as well (e.g., onlybar charts that are stacked).

Some implementations expand or build on techniques described in U.S.Pat. No. 8,099,674, entitled “Computer Systems and Methods forAutomatically Viewing Multidimensional Databases,” which has beenincorporated herein by reference in its entirety. Some implementationsexpand or build on techniques described in U.S. patent application Ser.No. 12/214,818, entitled “Methods and Systems of AutomaticallyGenerating Marks in a Graphical View,” which has also been incorporatedherein by reference in its entirety. Some implementations expand orbuild on techniques described in “Show Me: Automatic Presentation forVisual Analysis,” Mackinlay, Jock, et al., IEEE Transactions onVisualization and Computer Graphics, Vol. 13, No. 6, November/December2007, which is incorporated herein by reference in its entirety.

The terminology used in the description of the invention herein is forthe purpose of describing particular implementations only and is notintended to be limiting of the invention. As used in the description andthe appended claims, the singular forms “a,” “an,” and “the” areintended to include the plural forms as well, unless the context clearlyindicates otherwise. It will also be understood that the term “and/or”as used herein refers to and encompasses any and all possiblecombinations of one or more of the associated listed items. It will befurther understood that the terms “comprises” and/or “comprising,” whenused in this specification, specify the presence of stated features,steps, operations, elements, and/or components, but do not preclude thepresence or addition of one or more other features, steps, operations,elements, components, and/or groups thereof

The foregoing description has focused on certain view types, but thesame or similar techniques can be applied to many other view types aswell, including highlight tables, heat maps, area charts, circle plots,treemaps, pie charts, bubble charts, Gantt charts, box plots, and bulletgraphs.

The foregoing description, for purpose of explanation, has beendescribed with reference to specific implementations. However, theillustrative discussions above are not intended to be exhaustive or tolimit the invention to the precise forms disclosed. Many modificationsand variations are possible in view of the above teachings. Theimplementations were chosen and described in order to best explain theprinciples of the invention and its practical applications, to therebyenable others skilled in the art to best utilize the invention andvarious implementations with various modifications as are suited to theparticular use contemplated.

What is claimed is:
 1. A method of ranking data visualization options,comprising: at a computing device having one or more processors andmemory, wherein the memory stores one or more programs for execution bythe one or more processors: receiving user selection of a plurality ofdata fields from a data set; generating a plurality of datavisualization options that use a majority of the plurality of datafields; computing, for each data visualization option of the pluralityof data visualization options, a respective score for the respectivedata visualization option according to a set of ranking criteria, theset of ranking criteria including a first ranking criterion that isbased on values of one or more of the user-selected data fields in thedata set; creating a ranked list of the data visualization options,wherein the ranked list is ordered according to a plurality of computedscores corresponding to the plurality of data visualization options; andpresenting the ranked list to the user.
 2. The method of claim 1,further comprising: receiving user selection of a first datavisualization option from the ranked list; and in response to the userselection of the first data visualization option, displaying a datavisualization corresponding to the first data visualization option. 3.The method of claim 1, wherein the computation of scores for one or moreof the data visualization options uses historical data of datavisualizations previously created for the data set.
 4. The method ofclaim 1, wherein the computation of scores for one or more of the datavisualization options uses historical data of data visualizationspreviously selected by the user.
 5. The method of claim 1, wherein thecomputation of scores for one or more of the data visualization optionsuses a set of user preferences for the user.
 6. The method of claim 1,further comprising: receiving user selection of a filter that applies toa first user-selected data field; wherein the filter identifies a set ofvalues for the first user-selected data field, and wherein the pluralityof data visualization options are based on limiting values of the firstuser-selected data field to the set of values.
 7. The method of claim 6,wherein the set of values is one of: a finite set of discrete values; oran interval of numeric values.
 8. The method of claim 1, wherein the setof ranking criteria includes a second ranking criterion that scores eachdata visualization option of the plurality of data visualization optionsaccording to visual structure of values of one or more of theuser-selected data fields as rendered in the respective datavisualization option.
 9. The method of claim 8, wherein visual structureincludes one or more of: clustering of data points; presence ofoutliers; and monotonicity of rendered data points.
 10. The method ofclaim 1, wherein the set of ranking criteria includes a second rankingcriterion that scores each respective data visualization option of theplurality of data visualization options according to one or moreaesthetic qualities of the respective data visualization option asrendered using a plurality of data values stored in one or more of theplurality of data fields.
 11. The method of claim 10, wherein the one ormore aesthetic qualities include at least one of: an aspect ratio of arendered data visualization corresponding to the respective datavisualization option; and an extent to which a rendered datavisualization corresponding to the respective data visualization optioncan be displayed in its entirety on a user screen at one time in a humanreadable format.
 12. The method of claim 1, wherein each of the datavisualization options has a view type selected from the group consistingof text table, bar chart, scatter plot, line graph, and map.
 13. Themethod of claim 1, wherein generating the plurality of datavisualization options further comprises: identifying a first set of oneor more data visualization options previously presented to the user andnot selected by the user; and not including the first set of datavisualization options in the generated data visualization options. 14.The method of claim 1, wherein generating the plurality of datavisualization options further comprises: identifying a firstuser-selected data field; selecting a color palette for encoding valuesof the first user-selected data field based on an identifieddistribution of values of the first user-selected data field; andlimiting the generation to data visualization options that use theselected color palette for encoding the first user-selected data field.15. A computer system for ranking data visualization options,comprising: one or more processors; memory; and one or more programsstored in the memory for execution by the one or more processors, theone or more programs comprising instructions for: receiving userselection of a plurality of data fields from a data set; generating aplurality of data visualization options that use a majority of theplurality of data fields; computing, for each data visualization optionof the plurality of data visualization options, a respective score forthe respective data visualization option according to a set of rankingcriteria, the set of ranking criteria including a first rankingcriterion that is based on values of one or more of the user-selecteddata fields in the data set; creating a ranked list of the datavisualization options, wherein the ranked list is ordered according to aplurality of computed scores corresponding to the plurality of datavisualization options; and presenting the ranked list to the user. 16.The computer system of claim 15, wherein the one or more programsfurther comprise instructions for: receiving user selection of a firstdata visualization option from the ranked list; and in response to theuser selection of the first data visualization option, displaying a datavisualization corresponding to the first data visualization option. 17.The computer system of claim 15, wherein the one or more programsfurther comprise instructions for: receiving user selection of a filterthat applies to a first user-selected data field; wherein the filteridentifies a set of values for the first user-selected data field, andwherein the plurality of data visualization options are based onlimiting values of the first user-selected data field to the set ofvalues.
 18. The computer system of claim 15, wherein the instructionsfor generating the plurality of data visualization options furtherinclude instructions for: identifying a first set of one or more datavisualization options previously presented to the user and not selectedby the user; and not including the first set of data visualizationoptions in the generated data visualization options.
 19. The computersystem of claim 15, wherein the instructions for generating theplurality of data visualization options further include instructionsfor: identifying a first user-selected data field; selecting a colorpalette for encoding values of the first user-selected data field basedon an identified distribution of values of the first user-selected datafield; and limiting the generation to data visualization options thatuse the selected color palette for encoding the first user-selected datafield.
 20. A non-transitory computer readable storage medium storing oneor more programs configured for execution by a computer system havingone or more processors and memory storing one or more programs forexecution by the one or more processors, the one or more programscomprising instructions for: receiving user selection of a plurality ofdata fields from a data set; generating a plurality of datavisualization options that use a majority of the plurality of datafields; computing, for each data visualization option of the pluralityof data visualization options, a respective score for the respectivedata visualization option according to a set of ranking criteria, theset of ranking criteria including a first ranking criterion that isbased on values of one or more of the user-selected data fields in thedata set; creating a ranked list of the data visualization options,wherein the ranked list is ordered according to a plurality of computedscores corresponding to the plurality of data visualization options; andpresenting the ranked list to the user.